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Abstract 


The asynchronous approach to realise digital systems has been known for as 
long as the synchronous approach. However, synchronous approach was the pre- 
ferred choice because of its simplicity. With advances in VLSI device and fabrica- 
tion technology resulting in high integration, the synchronous approach for realising 
digital systems has to deal with the problems of critical path delay, clock skew 
and increased power dissipation. Besides, technology migration is difficult in the 
synchronous approach. 

These problems can be taken care of by adopting the asynchronous approach. 
This is the reason why this approach has seen a resurgence specially in the domain 
of mobile communications and handheld applications. 

Asynchronous design approaches are primarily based on the different delay mod- 
els used. In the present thesis, we develop a new design methodology based on the 
delay insensitive model for asynchronous circuits which uniformally uses the 2-phase 
non-return-to-zero transition signalling. We first develop a library of basic modules 
based on this approach. We also show how designs can be implemented using these 
elements through illustrative examples. To synthesize designs from their behavioral 
descriptions in a Hardware Description Language, we need to include additional in- 
terconnect elements for point to point and bus interconnection topologies. We study 
a few of these elements. Our approach is not amenable to synthesis based on the 
algorithms and approaches employed in the synchronous design paradigm. As such, 
in the latter part of the thesis, we hand synthesize a few designs from their HDL 
descriptions to study the various synthesis issues applicable to our approach. 
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Chapter 1 
Introduction 


Digital systems can be realised using either of the two broad classes of circuits, 
synchronous and asynchronous Designs based on the synchronous paradigm have 
been very widely used In synchronous systems, we use a clock to coordinate all 
its activities The clock is distributed to various parts of the system When the 
subsystems are widely distributed in a chip, their synchronization can be very difl&- 
cult because of the delays associated with the interconnection wires Tins problem 
IS known as the clock skew Furthermore, each individual circuit block needs to be 
designed using the worst case delays, to ensure that the worst case delay in a circuit 
block IS less than a clock period This leads to a very conservative design based on 
the worst case perfoimance As the device sizes reduce and their speed of operation 
increases, delays contributed by the interconnecting wires becomes dominant This 
worsens the problem due to clock skews even further 

Asynchronous systems, on the other hand, do not use any clock for coordinat- 
ing their internal operations This completely eliminates the problem due to clock 
skews It also results in designs having an average case performance instead of a 
worst case performance In asynchronous systems, active signals are confined to the 
vicinity corresponding to elements involved in carrying out a computation unlike 
in synchronous systems where transitions arising out of the clock signal takes place 
in even idle elements This results in a lower power consumption for asynchronous 
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systems Also, individual ciicuit blocks can be designed separately without hav- 
ing to satisfy any global timing constraint This eases the timing issues and rarely 
used portions of the circuit can be left unoptimised without hampering the overall 
performance to a great extent 

However, asynchronous circuits are more difficult to design as compared to their 
synchronous counterparts Moreover, the circuit implementations can be quite com- 
plex and are generally cumbersome Therefore, there exists a need for automating 
the design of digital systems based on the asynchronous approach Many methodolo- 
gies exist for the design of asynchronous systems [1] All of them use an underlying 
delay model and can be classified on the basis of the delay model they use The 
major delay models are as follows 

• Bounded delay model 

• Delay Insensitive model 

• Quasi-delay-insensitive model 

• Speed independent model 

The bounded delay model, also called the Huffman’s model assumes that delay 
m all the circuit elements and wires are known or at least bounded While designing 
circuits using this model, extreme care has to be exercised to avoid hazards [2, 3] 
One of the ways to avoid hazards is not to allow multiple input changes Hazards, 
due to single input changes are eliminated by adding redundant circuit blocks [3] 
However, this degrades the final performance of an implementation based on this 
model None of the methodologies available for this delay model addresses the sys- 
tem design issues Hence, synthesis methods based on this model have not evolved 

The delay insensitive model assumes that the delays in both the circuit elements 
and wires are finite but unbounded Because this delay model offers the least restric- 
tions on delay assumptions, the circuit design based on this model is very attractive 

i 

But it has been found that the class of purely delay insensitive circuits realised using 
the classical circuit elements like AND, OR, etc is extremely limited [4] 
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In speed independent model, it is assumed that while the gate delays are un- 
bounded, the wire delays are negligible The quasi-delay-insensitive model is based 
on the delay insensitive model where both the gate and wire delays are unbounded 
However, it restricts the wires connecting fanout elements to have the same de- 
lay This IS known as the isochromc fork Thus isochromc forks are the forking 
wires where the delays m all the forking wiies is nearly identical The quasi-delay- 
msensitive model can be seen to satisfy the assumptions made in the speed inde- 
pendent model 

As compared to the delay insensitive model, the speed independent and the 
quasi-delay-msensitive model offer moie implementation alternatives But the delay 
assumptions are difficult to realise The delay assumption in speed independent 
ciicuits IS no longer valid for all the technologies, e g , FPGA’s, where wire delays 
often dominate It is also not valid for large systems Also the implementation 
of isochromc forks in the quasi-delay-insensitive circuits can be difficult to realise, 
especially when the forking ends are on different chips Considering this, the circuit 
design under delay insensitive model is the most attractive option 

Ebergen proposed a synthesis method for designing delay insensitive circuits 
which IS based on the trace theory [5] It uses circuit elements like C element, 
toggle, merge, wire, etc Brzozowski and Ebergen [6] proved the C element and 
toggle blocks can not have a DI implementation using conventional level sensitive 
gates such as AND, OR, NOT, etc Leung and Li [7] have recently proposed a set of 
properties to characterize the DI behaviors of any circuit elements It has also been 
conjectured that, any basic circuit element used for realising the DI behaviors, can 
not be delay insensitive mteinally So, design of delay insensitive systems is possible 
using modules which are not internally delay insensitive One of the examples is the 
Q-modules designed by Rosenberger et.el [8] 

Though, Ebergen’s synthesis methodology provides a good theoretical basis for 
the design of delay insensitive circuits, there exists problems m its wide applicabil- 
ity The trace theory is very difficult for humans to understand as it is nomntuitive 
Designing circuits using this methodology forces a designer to think at individual 
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transition levels for each new circuit to be designed This renders describing behav- 
iors of large systems such as complex microprocessor very difficult The motivation 
for this work arises from the drawback observed above, and is to synthesize delay 
insensitive asynchronous digital systems from their behavioral description in a Hard- 
ware Description Language (HDL) This approach will provide a much simpler and 
easier route to employ than the one based on the trace theory Further, we assume 
a 2-phase, non-return-to-zero (NRZ) event driven scheme adopted in [9, 10] In this 
scheme, data signals are 2-rail A data value of T’ is indicated by a transition on 
one of its two rails (ri) and a ‘0’ is indicated by a transition on the other rail (ro) 
Transitions, also called events, occuring on both the rails simultaneously imply an 
invalid data and are not allowed Control signals, on the other hand, are single rail 
and a transition on a control wire initiates the control operation associated with it 
The NRZ transition signalling is used because, it gives better performance than 
the RZ transition signalling The RZ signalling is used m TITAC work [11], which is 
a quasi-delay- insensitive microprocessor Consider a register to register data transfer 
in this microprocessor, as shown m Figure 1 1 Let the data stored in Reg A and 
Reg B IS to be transferred to a functional unit FU The output of FU is then to 
be transferred to Reg C To carry out this task, controller makes its Request signal 
logic high This transfers the data in Reg A and Reg B to FU and its output is 
transferred to Reg C After Reg C receives data, Ack wire is made high to signify the 
end of operation This constitutes the working phase of the operation This is further 
subdivided into the working transient (WT) and the working stable (WS) subphases 
as shown m the figure After this the controller pulls its Request signal low This, 
in turn, pulls all the signals in the circuit low including Ack This constitutes the 
idle phase which is again subdivided into the idle transient (IT) and the idle stable 
(IS) subphases The presence of an idle phase degrades the performance 

The abovementioned task of designing and synthesizing asynchronous digital 
systems necessitates the existence of a library of basic modules designed using the 
DI model assumption Thus we need modules which can perform basic logic func- 
tionalities such as AND, OR, XOR, etc m the adopted signalling protocol Nanda 
et al [9, 10] use a U-gate to realise all the basic Boolean expressions In the same 
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Figure 11a) Register to register event driven data transfer b) Two phase operation 


work, a Shift Multiplier is designed using U-gates and other circuit blocks Conver- 
sion between logic levels and transition signals and vice-versa are done with lto2 
Converters and 2tol Converters respectively However, there is no mention of a con- 
sistent way to design and synthesize asynchronous systems Instead, an approach 
based on the equivalent synthesized synchronous design has been given This can 
lead to non-optimal implementations as will be shown later 

Chapter 2 describes the design of basic blocks which are intended to be present 
in the library to be used for synthesis In chapter 3, few design examples are illus- 
trated which use the basic blocks described in chapter 2 Interconnection between 
the various resources is described in chapter 4 Two methods of interconnection are 
considered, point to point interconnections and the bus structure Three approaches 
to implement the bus structure are described In chapter 5, various synthesis issues 
are presented, where we see that the synthesis issues are dominated by the inter- 
connection topology used Three interconnection elements for the point to point 
topology are proposed, which results in a better implementation of the intercon- 
nection network Two design examples. Shift Multiplier and differential equation 
integrator are presented to justify the synthesis process outlined earlier Chapter 
6 concludes the thesis and provides pointers to some issues which could not be 
addressed m the present thesis 
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Chapter 2 
Basic Modules 


Any implementation of digital systems requires a library of basic circuit elements 
to realise complex Boolean functionalities Resources to store data and to carry out 
the data transfers between them are needed very often The implementation of the 
controller, which coordinates the various activities in the system is also realised using 
the same library of circuit elements We therefore need to create this set of basic 
circuit elements to define the above library In the design methodology adopted, logic 
functionalities are realised using U gates [9, 10] The NRZ asynchronous data can 
be stored m either the asynchronous registers called UReg or its variation, REGl 
However, UReg and REGl are not the only circuit elements In fact, several other 
storage elements such as AJDemuxJStore, ConstJDemuxJStore and Iter_Var_Store 
which are more suitable for a particular data type, have been used The controllers 
have been implemented primarily using C elements [12], Select blocks [9, 10], XOR 
gates and Event Counters 

This chapter starts with the description of UReg and REGl It is followed by 
the Event Counters Asynchronous toggle flipflops are described next Finally, a 
way to ensure a known initial condition m the circuit elements is illustrated 
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2.1 Asynchronous Register (UReg) 

The objective of designing the UReg is to store and transfer out the asynchronous 
data for the NRZ transition signalling format 

Figure 2 1 shows the functional representation and circuit diagram of the UReg 
It has a 2-rail input IN, control signals Datastore-C and Data^out^c and 2-rail out- 
puts Datastore and Data-out indicated by {DSi,DSq} and {Di,Do} respectively 



Figure 2 1 Asynchronous Register UReg 


Input data is applied at IN Datastore-c controls the availability of the input 
to the output terminals Datastore It also retains the present input in the register 
so that the next application of Datastore-c will generate the same output on the 
Datastore terminals Data-out-c, on the other hand, controls the availability of the 
input to the Data-out terminals, after which, a new input can be applied to the 
UReg 
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The ciicuit can be divided into two blocks, namely, data store block and data 
shift block, both of which receive the input The data stoie block consists of Muller 
C elements C2 and Cz with XOR gates connected to them Feedback from the data 
shift block is also connected to the XOR gate inputs m this block The other signals 
in this block are the control signal Datastore-C and the corresponding 2 -rail output 
signal Datastore Muller C Elements Co, Ci and XOR gates connected to it along 
with Data-ouLc and Data-out terminals constitute the data shift block 

Circuit Operation 

Suppose bit 1 IS applied at the input, 1 e , a transition occurs on INI Theiefore, 
one input of Ci and C2 see a transition Now the application of Datastore-C will 
place transitions on the inputs of C2 and C3 As both the inputs of C2 have received a 
transition, C'2 will fire to produce an output transition at DSi Thus, the application 
of Datastore-c again reproduces the input on Datastore terminals The transition 
on Di IS fed back to C'2, m order to be able to store the data The same transition 
IS used to cancel the predeposited transition on the input of C3 Thus UReg is 
taken to the same condition that was present, before the application of Datastore-c 
signal with respect to the given input New Datastore-c can be applied to induce 
an identical set of events, producing a new transition on DSi 

To transfer the stored data corresponding to the above input, Data-out-c is 
applied The transition on Data-out-c is deposited on one of the inputs of Co and 
Cl Cl will produce a transition on Di, indicating that the data has been shifted 
out Also, the same transition cancels the predeposited Data-ouLc transition on Co 
and the transition corresponding to the input data on the input of C2 m the data 
store block With this, there will exist no transition in the UReg, which then will 
be ready to receive the next data bit 

Simultaneous changes m the input IN and one of the control signals is valid only 
when no transition exists m the UReg, 1 e while using the UReg for the first time 
and after each application of Data-ouLc Of course, only one input rail can have a 
transition at a time and simultaneous occurences of transitions on both the control 
signals are prohibited No new data can be applied before the previous data has 
been shifted out of the UReg Also, successive application of Datastore-c should 


8 



be separated by subsequent occurences of Datastore, to fulfill the “cause-reaction” 
relationship, which characterizes the delay insensitive way of functioning 


2.1.1 REGl 


A register REGl has a 2-rail input Inp, two control signals DLout-c and Clr and 
a 2-rail output R^out It also has a ClrAck output A transition on Clr input 
terminal removes the old data out The completion of this is indicated by a transition 
on ClrAck output A transition on the DLouLc control input on the other hand, 
produces the data in the register on the R-Out terminals The data is retained by 
the register in this process The register is designed such that on global reset, it 
will be loaded with a value of 0 



R_out 


ClrAck 


A REGl is realised using a UReg as shown in Figure 2 2 The inverter at its input 
ensures a presence of a data bit 0 on the application of a global reset A transition 
on Clr input terminal is applied to the Data^ouLc input terminal of UReg and this 
shifts the contents of UReg on its Data-out terminals The XOR gate converts this 
data to an acknowledge signal ClrAck signifying the completion of operation It can 
be seen that, a transition on Dt-out-c input terminal of the REGl produces the 
stored data on R-out terminals and is also retained 
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2.2 Event Counters 


A modulo n Event Counter has one single rail input Eventin and n single rail outputs 
Oi through On In this, the ith input transition, [i < n), creates a transition on 
output After the nth input transition, the cycle repeats 

One of the possible implementation is as shown in the Figure 2 3 The inverters 
cause an initial transition to be deposited on the those inputs of the C elements, to 
which they are connected after a global reset 



Figure 2 3 Event Counter Implementation 1 


Each odd numbered input transition on Eventin will place a transition on one 
input of an odd numbered C element, viz , (7i, Ca, It will also cancel the input 
transition on even numbered C elements, viz , C 2 and C4 connected to Eventin 
through the inverters Similarly, each even numbered transition on Eventin will 
place a transition on an input of even numbered C elements and will cancel the 
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Delay from Eventin to input of Cl through 03 
> 

Delay from Eventin to input of Cl through Invl and Inv 2 


Delay from Temp to input of Cl tlirough D 
> 

Delay from Temp to input of Cl 
through XOR gate, Invl and Inv2 


Figure 2 4 Event Counter Implementation 2 


predeposited transition on corresponding inputs of odd numbered C elements Thus, 
at any given instant, only a single C element has a transition on that input which 
is not connected to Eventin 

Initially, Ci has a transition on one of its input The first transition on Eventin 
creates a transition on output Oi This transition is issued to the input of C2 As the 
transition on the other input of C2 has been canceled by the previous transition on 
Eventin, possible hazardous event on O2 is avoided The next transition on Eventin 
will enable C2 to create an event on O2 The transition on the nth output is fed 
back to the input of Cl so that a subsequent transition on Eventin will create a 
transition on Ox 

Between any two successive transitions at the output of any C element, it receives 
n transitions on its input connected to Eventin Out of these n transitions, the 
first transition is utilized to create an event on the output of the corresponding C 
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element and is thus consumed The remaining n — 1 transitions should not result 
m any transition on this input of the C element Thus n is forced to be an odd 
number 

Hence, if the number of outputs is even, the circuit needs to be modified so that 
the above condition is satisfied Such a modified circuit for four outputs is shown 
in Figure 2 3 An intermediate output indicated by Temp is used to create an extra 
transition at the input through an XOR gate The nth (even) transition creates a 
transition on Temp, which is fed back to create an additional event, fulfilling the 
condition that the input should receive odd number of transitions This in turn 
produces the final output transition 

Alternatively, Temp can be used as O4, eliminating the use of 6*4 However, this 
imposes the following delay constraint Delay from Temp to input of C\ connected 
to It through an inverter should be greater than that from Temp to the other input 
of Cl through an XOR gate 

Let a transition on Eventin create a transition on output of any C element Ct 
This transition on Eventin must cancel the transition on the input of Ci+i connected 
to Eventin, before Cj+i receives a transition on its second input generated at the 
output of Cl Thus the delay in the path from Eventin to input of Ci+i through 
output Oi should be greater than that in the path from Eventin to the other input 
of Ci+i 

All these local delay constraints can be avoided by modifying the circuit such 
that only one delay constraint needs to be satisfied We ensure this as follows A 
transition to the input of C element Ci connected to Eventin is issued only after a 
transition on input of Ci+i connected to Eventin is canceled The modified circuit 
IS shown in Figure 2 4 

The abovementioned condition is violated for Cn, as a transition to its input 
connected to Eventin can be issued before the respective transition on Cl is canceled 
This IS corrected by the delay element D, satisfying the delay constraint indicated 
in the figure 

Delay constraints for realisation of Event Counter of Figure 2 3 and Figure 2 4 
can be eliminated by using a variation of Muller C element This element can be 
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realised as shown in Figure 2 5 Assuming a delay constraint such that the delay in 
the lowei input path is less than that in the upper path, this block can be rendered 
into an Delay Insensitive (DI) block Thus, event counter can have a quasi-delay 
insensitive realisation This is shown in the Figure 2 5 


05 fed back 



Evcniln 

Figure 2 5 Event Counter Quasi-delay-insensitive realisation 


The complexity of each of the structures discussed above is linear with the num- 
bei of outputs required m a mod n counter For every increase in n by 1, at most 
one C element and an inverter is needed 



Figure 2 6 Mod-18 Event Counter 


If all n outputs of a modulo n Event Counter are not needed, the circuit can be 
made compact by realising it using Event Counters with smaller number of outputs 
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Foi example, a modulo 18 counter is realised using a single modulo 2 and two 
modulo 3 counters as shown in Figure 2 6 Counter A, which is a modulo 3 counter, 
responds to every event on its input, Eventin It recycles after three consecutive 
events on Eventin Thus, output O3 of Counter A shows a transition on the third, 
sixth, nineth, event on Eventin and so on This output is connected as an input to 
Counter B, which is a modulo 3 counter Therefore, output Oq of this counter, will 
respond to every ninth transition on Eventin It can be seen that, Os responds to 
every eighteenth transition on Eventin The response of other outputs can be easily 
veiified 


2.3 Asynchronous Conditional Toggle 
FlipFlop (A_CTfF) 

The objective of an A.CTff is to have an asynchronous counterpart of synchronous 
toggle fiipfiop The functional representation and circuit realisation is as shown in 
Figure 2 7 It has two double rail inputs T and Imt, two single rail control signals, 
NexLstate and Clear It also has two double rail outputs, QO and Cascade 

Clear is used to remove a data bit residing m the flipflop in the form of a tran- 
sition The completion of a clear operation is indicated by a transition on ClrAck 
Depending on binary value present in the T input, a transition on NexLstate toggles 
the current contents of the flipflop or retains it After the application of a global 
reset or a Clear, no transition exists m the flipflop Initialization inputs can be ap- 
plied to the Imt terminals only after the flipflop has attained the above state Data 
in the Imt input is immediately made available on Qq output 

Now for every transition on Nextstate input, if the input T is 1, the existing bit 
m the flipflop is toggled and relevant transition is produced on the Qo terminals 
On the other hand, if T is 0, the same data bit is produced on Qq terminals Two 
rail signal Cascade is used to cascade the flipflops in applications such as the two 
rail counter described in the fbllowing chapter 

The A_CTff shown in Figure 2 7 is realised using two U gates and 5 XOR gates A 
data bit on Imt input is stored in U2 Application of a transition on C^ear terminals 
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Cascade 
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Cascade 


ClrAck 


Figure 2 7 Asynchronous Conditional Toggle FlipFlop 


transfers the stored data to Clr output of U2 This is the process by which any 
stored data is removed fiom the fiipflop To signify the completion of the data 
removed from the flipflop, we generate a ClrAck using an XOR gate 

Application of a transition on Next^state^ on the othei hand, moves the data bit 
to Cascade terminals These terminals are fed back as an input to U1 Assume T 
is 1 Then the initial data bit stored m U2 and also present on Cascade terminals 
through an event on the NexLstate input terminal, is shifted to the Toggle terminals 
of U1 While if Tis 0, it is shifted to the Pass terminals of U1 The Toggle and Pass 
terminals are connected along with Init inputs such that a data bit produced on the 
Toggle teiminals is inverted at Qo, while that produced on the Pass terminals goes 
through unchanged 

The processed data with respect to T is available at Qo It is also stored in U2 
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Thus we see that, m response to an event on NexLstate input, a stored data is moved 
out of U2 through Cascade terminals Depending on the value of T, it is processed 
by U1 and the processed data bit is fed back to U2 This constitutes a loop of events 
for which there exists a possibility of hazards In fact, we can see that, in the above 
implementation there exist no hazards because the processed bit received by U2 is 
generated using the one which was stored earlier Thereby, a temporal dependence 
of the new event on an earlier event precludes generation of any hazard After this 
the fiipflop IS ready to receive next control signal The necessary acknowledge to 
the conti oiler can be generated using Qo terminals 

For the proper functioning of A.CTff, the following is implicit 

1 The control signals Next-state and Clear cannot be applied simultaneously 

2 Next-State or Clear can be applied only if data is residing m the fiipflop 

3 Each occurence of Nextstate should be supported by an application of T in 
order to complete the desired operation 

4 Reset or Clear should be followed by a new data on Imt 

2.4 Asynchronous Toggle FlipFlop (A_TfF) 

Compared to the A.CTff, A-Tff has the same set of inputs and outputs except for 
the absence of input T The only functional difference lies in the fact that, on every 
transition on the Nextstate input, the value of the initial data stored in U is toggled 
by inverting the same, and is made available at the Cascade terminals as shown 
in Figure 2 8 This implementation uses only a single U gate which is functionally 
equivalent to U2 of A.CTff 

2.5 Modified lTo2 Converter 

The lto2 converter has a logic level input C It receives a 1-rail control signal Read 
which IS internally referred to as A It has a 2-rail output denoted by {Ai, Aq} A 
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Figure 2 8 Asynchronous Toggle FhpFlop 


transition on the Read terminal creates a transition on if (7 is 1, else, a transition 
IS created on Zq The circuit is realised using two subcircuit blocks BR and BF 
shown in Figure 2 9a 

The block BR creates a transition on its output Z, if the input C is high (level 
sensitive) and a rising transition is applied to the X terminal This can be understood 
as follows Let Cbe high and AT be low The complement of Zis stored by Ci, while 
C 2 retains its previous value When X goes high, the value of C 2 becomes the 
complement of Ci which is Z Also Ci gets isolated from Z since X is high This in 
turn, creates a transition on Z In the block BF, a transition on Z is created for a 
falling transition on X, provided C has a logic level 1 

The modified lto2 Converter is realised using these blocks as shown in Figure 
2 9b 
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Figure 2 9a Modified lto2 Converter blocks BR and BF 

2.6 Initialization on Reset 

All the circuit elements need to be taken to a known initial state on power on and 
global resets It is assumed that, on the application of global reset, the input and 
the output terminals of all the C elements are set to a logic value of 0 In order to 
achieve this, a resettable C element is used Besides the input and output terminals, 
it also has a Reset terminal An application of a logic 1 to the Reset terminal forces 
the output of C element to take a logic value of 0 Two possible implementations 
are shown m Figure 2 10 

As all the basic circuit elements are realised using C elements and XOR gates, 
an application of logic zero on the Reset terminal along with the application of 
logic zero on all the primary inputs force the circuit elements to the desired initial 
condition 
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Chapter 3 

Design Examples 


The use of the basic modules described in the last chapter in implementing asyn- 
chronous designs is illustrated through three examples The first example is that of 
a Serial In Parallel Out (SIPO) shift register This is followed by the design of a 
Polynomial Serial Parallel Multiplier (PSPM) Finally, the design of a 2-rail mod-8 
up-down counter based on the A_Tff and A_CTfF basic modules is given 

3.1 Serial In Parallel Out Shift Register 

In many DSP applications, data is applied serially and is processed as it travels on 
its way to the output Designs based on such a paradigm is illustrated in the next 
section with the example of a Polynomial Serial Parallel Multiplier (PSPM) This 
multiplier employs a Serial In Parallel Out (SIPO) shift register described below 

The design of PSPM follows the Micropipelme paradigm introduced by Suther- 
land [12] The only difference m our case being that the data is 2-rail NRZ event 
driven, as we assume the delay insensitive model instead of the bundled data con- 
straint 

The module represenjjation of a 4 bit SIPO asynchronous shift register is as 
shown in Figure 3 1 It receives a 2-rail input Datain 2-rail outputs are available 
on Qi to Qi, by the application of the Data.c control signal Control signal Shtft.c 
IS used to shift in the applied data while Shift-out.c is used to take data out of the 
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last register Completion of this is acknowledged on the ShiftOutAck terminal A 
transition on ShiftlnAck indicates that new data bit can be applied at Datain 
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Figure 3 1 Serial In Parallel Out Shift Register 


The realisation consists of two distinct blocks as shown m Figure 3 1 The first 
block has four URegs while the second block is that of the controller controlling the 
data transactions which takes place between the adjacent URegs In this realisation, 
on application of a global reset, no data is present m the shift register and the con- 
troller IS taken to an appropriate state A state in the controller implies transitions 
present at the inputs of the C element Ci, C 2 , C 3 Due to the inverters connected 
to the C elements, on the a global reset, the b input of every C element receives a 
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transition 

The application of a tiansition on Shift-c will generate a transition on Data-out -Ci 
This eventually results in transitions on Data-out-C 2 and Data.out-Cz When the 
URegl to URegS receive a transition on their Data-out-c terminal, they pass the 
value present on their respective inputs to their Data-out terminals Therefore the 
data applied to Datain of the shift register passes through URegl, UReg2 and URegS 
to finally reach UReg4 The resulting transitions on Data-out 2 and Data-outs are 
used to deposit a transition on each of the b inputs of Ci and C 2 , while the b input 
of Cs does not receive any transition A transition on Data-outi is converted into 
ShiftlnAck indicating that a new data bit can be applied 

The next transition on Shift-c generates only Data-out-Ci and Data-out-C2 and 
not Data-out-Cs The newly applied data bit is therefore shifted to URegS In a 
similar manner, a new data can be stored in UReg2 However, when a new data bit 
IS to be stored m URegl, it is not necessary to have a transition on Shift-c However, 
if Shift-C were to be applied, it would remain unconsumed for the final bit stored in 
URegl This process loads a new data in the shift register The data loaded in the 
shift register can be made available on the outputs Qi to by each application of 
Data-c signal 

To clear the stored data present m the shift register, Shift-ouLc signal is used 
Tins drives the data contained in UReg4 to its Data-out^ terminals This causes a 
transition to be deposited on the b input of C 3 Cs has a transition waiting on its 
a input as a result of application of data bit to UReg4 during the loading phase 
This will finally result in a transition on Data-out-Cs which causes the data stored 
in UReg3 to be shifted to UReg4 Similarly, data in UReg2 is shifted to URegS and 
that in URegl to UReg2 As described earlier, the unconsumed transition on Shift-c 
during the loading phase is used up here At the end of this, no data is contained 
by URegl Thus, during the clear phase, we see an operation similar to that of 
the shift right operation taking place However, the actual shift right phase will 
involve new data being applied on the Datain terminals along with transitions on 
control signals Shzft-c and Shift-out-c The intermediate generation of ShiftOutAck 
enables the application of the next Shift-out-c signal to clear one more bit in the 
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shift register Repeated application of Shtft-OuLc will remove all the data loaded 
initially in the shift register 


Data_c 



Figure 3 2 Serial In Parallel Out Shift Register Parallel loading 

Shifting a new data bit while removing one from UReg4 is possible by simulta- 
neously applying Shift^c and Shift^out^c along with the application of new data bit 
on Datain terminals This can be done for both partially, or a completely filled shift 
register 

The above circuit can be modified to enable parallel loading This is shown in 
Figure 3 2 The controller needs to be modified to result in a state corresponding 
to that of a fully loaded shift register, corresponding to its being fully loaded This 
is done by depositing a transitions on the a inputs of C 2 and C3 These transitions 
can be derived from the initialization input transitions 
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Other shift register configuiations are also possible with relevant changes in the 
controller and URegs 

3.2 Polynomial Serial Parallel Multiplier 

Serial Parallel Multiplier implements one of the possible ways of carrying out a 
multiplication A polynomial multiplication is a multiply without carries [13] It is 
used in error correction coding 

Let d and h be two three bit polynomials, wheie 

d = d2X^ -\-diX^ + do 
b — b2 x^ b\ x^ "t" bf) x^ 

where, 6o) ^ 2 , do, di, da, take values from {0,1} The PSPM performs multipli- 

cation on two such polynomials Let d be the multiplier, and b be the multiplicand 
polynomial Let the resultant polynomial be s Then s is given as, 

s = S4 -I- S3 -h S2 x^ -f Si -f Sq x^ 

where, Sq, Si, sa, S3, S4, S5, take on values from {0,1} and are obtained according to 
the following Boolean expressions 


S4 

= d 2 AND 62 




S3 

= (d2 AND 61) 

XOR 

(di 

AND 62) 

S2 

= {d 2 ANDbo) 

XOR 

(di 

AND bx) XOR (do AND 62) 

Si 

= (di AND bo) 

XOR 

(do 

AND bx) 

So 

= do AND bo 





The description of the PSPM is organized as follows First the datapath is explained 
Then a brief idea of the controller is given It is followed by the circuit operation, 
along with the description of the controller structure Finally, an efficient version of 
PSPM IS given This is obtained by modifying the design presented below 

The datapath of a 4-bit Polynomial Serial Parallel Multiplier is shown in the 
Figure 3 3 It consists of two shift registers, each four bit wide These shift registers 
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Shmp Mp^ShiftlnAck ShiftOutMp Mp_ShiftOutAck Data_c 


Figure 3 3 PSPM Datapath 

store the multiplicand and the multiplier respectively The U gates in the datapath 
realise the necessary logic functionality to obtain the resultant polynomial All the 
seven bits of the product are serially available, Isb first, on the output terminals 
S The related control signals for the shift registers are indicated m the Figure 3 3, 
however the associated controller is not shown for the sake of simplicity 

One input of the left most XOR gate needs to be permanently connected to 0 
The transition corresponding to the 0 is obtained by tying the corresponding rail to 
the Data-c signal Data.c causes both the shift registers to output the stored data 
corresponding to the multiplicand and multiplier This signal therefore initiates the 
computation of a single bit of the product polynomial The end of each computation 


25 



















IS indicated by a transition on the OutAck terminal 

The controller shown m Figure 3 4a, employs three loops The execution of the 
first loop loads the shift registers with the multiplicand and the multiplier The 
second loop controls the computation of the product polynomial s and the third 
loop clears the stored data from the shift registers so that new set of data can be 
applied 

To control the iteration count of each of the three loops, the controller employs 
an Event Counter The Event Counter counts the number of events taking place on 
its Inc input This is a modulo 15 counter to take into account all the iterations of 
the three loops The count is available on Ci to Cis in decoded form for each of the 
three loops 

These outputs are processed to derive the necessary information for the con- 
troller For example, the 2-rail signal Counts shows 0 for Ci, C 2 and Cz and becomes 
1 on C 4 , It does not respond to any of the other Event Counter outputs While the 
signal Countll shows a 0 for outputs 6*4 to Cio and 1 for Cn, not responding to 
any other Event Counter outputs 

Figure 3 4b shows the additional circuitry required to generate control signals for 
the datapath elements , e g the control signals F/imp, Data^c, etc Inc is an another 
example of the control signal Similarly acknowledge signals from the datapath 
are converted into relevant acknowledge signals for the controller using the circuits 
shown in the Figure 3 4b These are discussed later 

Multiplier Operation : 

The multiplication operation is initiated in the PSPM with a transition on its 
Start terminal This generates an event on Ci of the Event Counter, which initiates 
execution of the first loop in the controller 

In this, the multiplicand is loaded in the multiplicand shift register, Isb first, 
such that Ureg4 contains Isb At the same time, UReg2 to Ureg4 of the multiplier 
shift register receive 0 The loop is executed thrice, so that three of the four URegs 
of both the shift registers receive data Each loop iteration increments the count 
by one Execution of the consecutive iterations is separated by ShiftlnAck signals 
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issued by the shift registers to the controller to ensure hazardless functioning A 
transition on the wire in the controller is used to put the msb of multiplicand in 
the URegl of the corresponding shift register 

In the first iteration of the second loop, Isb of the multiplier is stored m URegl 
At the same time, both the registers output their contents to the U gates imple- 
menting the logic expressions for s All the bits, excepting the Isb of the multiplier 
shift register are zero Hence the least significant bit obtained on terminals S will 
be given by (do AND 5o) After receiving the OutAck signal, the multiplier data is 
shifted left by one bit This puts the Isb of the multiplier in UReg2 In the next 
iteration, the next significant multiplier bit is applied and the next computation is 
initiated The process repeats seven times, so that all the seven bits of the prod- 
uct are computed In the last three iterations, a 0 is shifted m the multiplier shift 
register 

The four iterations of the third loop clears the shift registers for the next multi- 
plication 

It can be seen that, MpShiftlnAck signal of the multiplier shift register is gen- 
erated for every iteration in the first two loops and once in the third loop As this 
signal IS available on a single wire, it needs to be decoded to two different acknowl- 
edge signals for the first and second loops, while the occurance of this signal m the 
third loop should be accounted for separately This is achieved by the two Select 
blocks and an XOR gate shown in the Figure 3 4b This signal MpShiftlnAck is 
produced on MpShiftlnAckl m response to a transition on wire W 2 of the first loop 
in the controller It is produced on MpShiftInAck2 in response to a transition on 
tuy while a transition on nig produces it on terminals NC which are not connected 
to any block A similar arrangement exists for MlShiftlnAck and Mp Shift OutAck 
The total execution time for the PSPM can be reduced by assuming that both 
the shift registers contain some arbitrary data on global reset The loading of new 
data for each multiplication operation can then be performed while simultaneously 
removing the stored data corresponding to the previous multiplication operation 
Minor modifications in the datapath involves using an inverter in any of the 2-rail 
input for all the URegs used in the shift register of Figure 3 3 Also the inverters 
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connected to the b input of all the C elements Ci to C3 are removed Inverters are 
added to the a input of the C elements C 2 and C 3 These inverters in the shift 
registers ensure initial data after a global reset The modification in the controller 
of the shift register reflects the modified status of the shift register 

The modified controller for the PSPM is as shown in Figure 3 5 and is self 
explanatory 

3.3 Modulo 8 Initializable Up/Down Counter 

As the name suggests, the objective of this circuit is to count both m the up and 
m the down direction The counter is also initializable using the Clear and Imt 
signals Besides double rail input terminals Imto, Imti, and Imt 2 , it has Up-count 
and Down-count control input terminals which set the mode of counting A single 
rail Clear terminal is meant for removing all the transitions corresponding to the 
current count present m the Counter The completion of this operation is signified 
by an event on ClrAck After global reset or a Clear, initialization inputs can be 
applied on Imt inputs This can be followed by an event on either the Up-count or 
the Down-count input 

The Counter is designed exactly as its synchronous counterpart It is then im- 
plemented with the A-CTff and A_Tff and the necessary U gates to realise Next 
State equations as listed in Figure 3 6 Figure 3 6 also shows the implementation of 
the Counter The Up-COunt and Down-count control signal wires are disjunctively 
combined to derive the Nextstate control input for all the flipflops The ClrAck 
signal of all the flipflops are used to get a ClrAck 

The circuit block B receives as its inputs the Cascade output of both the A_CTff 
and A_Tff It also has the Up-count and the Down-count control inputs In the 
down count mode, it inverts the input while in the up count mode, the inputs reach 
the output terminals uncomplemented This block is very easily designed using a U 
gate and two XOR gates 

Every time the flipflops change state, after receiving the Nextstate input, the 
present state is available on the corresponding Cascade terminals These state bit 
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values are then processed to derive the Next State T input, for each of the flipflops 
This IS done using the B block and the AND gate implemented using a U gate 
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Figure 3 4b PSPM Controller 
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Figure 3 5 PSPM Controller (Improved Version) 
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Chapter 4 


Interconnections 


4.1 Introduction 

Different resources like registers, datapath elements and input and output ports 
need to be interconnected to realise any digital system The interconnections can 
be carried out in several ways, out of which, the point to point interconnection 
and the bus topology have been widely used in the synchronous systems In this 
chapter, we explore the possibility of employing similar interconnection structures 
for realising digital systems using asynchronous elements In the point to point 
interconnection scheme, there is a direct connection from output of one resource to 
the input of another resource which needs it However, due to resource sharing, it 
becomes necessary at different instants of time, to connect one resource output out 
of several resources to the input of another resource While in the bus topology, the 
output of each resource is transferred to a common resource shared by all of them, 
usually the bus The output is then routed to the desired destination 


4.2 Point to Point Interconnections 

Assume that the data in the register is to be transferred to only a single destination 
Then a direct connection is made between the two However, if the data is to be 
transferred to more than one resource, then a direct connection is not possible To 
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realise this task, the output of the source register needs to be routed to the appropri- 
ate destination This can be realised using a A_Demux Functional representation of 
the A_Demux is as shown in Figure 4 la In is the two rail input Ci to Cn are the n 
single rail control signals AJDemux has n 2-rail outputs Oi to On The application 
of one of the control signals Ci transfers the input data to the corresponding output 
0^ The circuit realisation for n = 4 is as shown m the Figure 4 la 


Cl 


Cn 



Cl 


C2 


In 





01 02 04 03 
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Figure 4 la AJDemux 


An application of Data-out and one of the control signals C'^ to the AJDemux 
will transfer the data in REGl to Oj as shown in 4 lb It can be seen that the 
realisation for n = 1 is a Select block 


4.3 Bus Structures 

The bus interconnection scheme can be studied in the following general setting 
Assume the following resources are given. 

1 A register file of n registers, each m bit wide 

2 pi input ports and po output ports, each m bit wide 
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Figure 4 lb Data transfer using AJDemux 


3 An m bit wide bus which is nothing but m 2-rail wires i e 2m wires 

The bus interconnection should allow the following classes of data transfer for 
the resources given above 

1 Register to register 

2 Register to output port 

3 Input port to register 

In the descriptions that follows, the register to register data transfer will be 
indicated by Rj — Y Rj, where i indicates the source and j the destination The 
register to port data transfer is indicated by R^ — > Pj and the remaining data 
transfer is indicated by Pi — V Rj 

In the design methodology adopted, any register communicates data with the 
datapath elements m the same way as it communicates it with the ports Hence, 
datapath elements receive data through Ri — v P^ and their output is routed to 
registers through Pi — y Rj 

The typical bus structure is as shown m Figure 4 2 It consists of n registers, 
Po output ports and pi input ports connected together using the bus The data 
transfers between them are coordinated by the local bus controller The local bus 
controller receives the control signals from the main controller CONI For each data 
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transfer carried out on the bus, an event on the FmalAck is generated, signifying 
the completion of the data transfer This signal is sent to the CONI 
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Control lines to various resources 


Local Bus Controller 
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Figure 4 2 Block diagram of a bus structure 

We assume that the main controller CONI provides the following control signals 
to the local bus controller 

1 For the n registers 

(a) n source control wires, WFi to WS„ 

(b) n destination control wires, WDi to WDn 

2 po register to port transfer control signals, WPi to WPo 

3 A port to register transfer control signal, P2rJ.rf 

Furthermore, it is assumed that the following events are available to the bus 
structure for different modes of data transfer 

Ri — )■ Rj A transition each on W St and W Dj 
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Ri — ^ Pj A transition each on WS^ and WPj 

Pi — ^ Pj A transition on WDj and on P2rJtrf 

All the above assumptions are made in order to isolate the design of the bus 

structure from the details of the mam controller CONI 

4.3.1 Basic Idea 

All the three approaches to be discussed use the same underlying basic idea as given 

below Two distinct phases constitute a typical data transfer, Data clearance and 

Data transfer to the destination 

Data clearance This phase is carried out only when the destination is a register 
In this, the data in Rj is cleared and a ClrAck is generated to signify the 
completion of the operation 

Data transfer The bus receives data from all the input ports and from the outputs 
of all the registers All these 2-rail signals corresponding to a bit position in 
an m bit wide bus are merged into a single 2-rail signal through an Merger 
element The output of the Merger could either be transferred directly to the 
bus or transferred after necessary processing The bus transfers data to all 
the registers and output ports The signal on the bus is routed to the desired 
destination using a 2-rail Data Decoder For every data transfer, a completion 
signal called FtnalAck is generated by the bus This serves as an acknowledge 
signal to the main controller CONI 

Ri — y Rj This data transfer involves clearing the previous data stored in Rj This 
IS then followed by transfer of data from Rt to Rj 

Pi — y Rj In this, we assume that the data is already available at the input port 
The destination register is first cleared as above and then the data present in 
port P^ is routed to the Rj 

Ri — y Pj In this, the data in Ri is transferred to the bus and is then routed to the 
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The lest of the chapter describes three approaches to realise the bus structure 
Each appioach is described as follows Fust the datapath and the local bus controller 
aie described The operation in the three modes of data transfer is illustrated 
Circuit complexity in terms of the number of C elements and XOR gates and the 
delay in implementing a data transfer — v Rj is given Finally, the comparison of 
the three approaches m terms of the circuit complexity and delay is made The delay 
constraints for the Data Decoder are specified and an alternative implementation 
of the Basic Data Decoder module is given which can be used m case the delay 
constraint can not be satisfied easily We also present the implementation of a 
Pi — > Pj data ti ansfer that is needed for transfer of data directly between datapath 
elements 

All the bus structures are implemented for 1 bit For m bits, m instances of the 
datapath should be used 

4.3.2 Approach 1 

The datapath needed for this approach is as shown in Figure 4 3a The datapath 
consists of a 2-rail merger, a Data Decoder, n registers and a 1-rail merger to generate 
the FmalAck signal 

The inputs to the 2-rail merger are the outputs of the registers and the input 
poits These inputs are combined such that, a transition on any one of them is 
routed to the output Only one input gets routed to the output at any instant The 
mergei is realised as shown in Figure 4 3b The output of this merger is connected 
to the bus The input of the Data Decoder is also connected to the bus The 
Data Decoder receives n Reg-out and po PorLout control signal from the local bus 
controller It has (n + po) 2-rail outputs Depending on a transition on one of the 
Reg-out or Port-Out control signals, 'the input data is routed to the appropriate port 
or the register Each time an input data is transferred by the Data Decoder, an 
acknowledge is generated on one of its (n + po) DdrAck and PortDdrAck outputs 
All these acknowledges are disjunctively combined to derive the FmalAck signal by 
a 1-rail merger The register used in this approach is REGl described earlier m 
chapter 2 
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Figure 4 3a Bus approachl Datapath 


The Basic Data Decoder module, instances of which are used to realise the Data 
Decoder, is shown in Figure 4 3c It has a 2-rail input In and a control signal, 
Control It also receives feedback from all other Basic Decoder modules used in 
the realisation of the Data Decoder It outputs on the Out terminals An XOR 
gate connected to the Out terminals generate an DdrAck signal, which signifies the 
completion of the data transfer from the input of Basic Data Decoder module to its 
output C elements Ci, C 2 and the XOR gate connected to their y inputs constitute 
a Select block Every time a data bit is delivered by the bus, it is received by the 
X input of either Ci or C 2 It can be inferred from the functionality of the Select 
module that, a transition on the Control input terminal will produce the transition 
on the Out terminals This transition corresponds to the transition representing the 
input data An Ack signal signifies the completion of the transfer of the input data 
to the output terminals Only one of the (n + po) Basic Data Decoder modules in 
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For a 2-rail merger combining n signals 
Number of XOR gates = 2(n-l) 

Delay = log 2 (n) * XOR gate delay 

Figure 4 3b 1-rail and 2-rail merger for 5 inputs 


the Data Decoder block transfers the input data to its output terminals However, 
the remaining Basic Data Decoder modules receive the same input transition These 
extraneous transitions need to be canceled The transition cancellation is achieved 
by feeding back the output transition as input to the remaining Basic Data Decoder 
modules 

The local bus controller is shown in Figure 4 4 It consists of n C elements Ci 
to Cn One input of is connected to the destination control wire PFA, which 
IS also connected to the Clr input of the Register A The other input receives the 
ChAck output of R^ The output of C^ is connected to the Regjout^ control signal 
of the Data Decoder The other inputs to the local bus controller, from the mam 
controller, CONI, are also connected to the different modules m the datapath block 
as shown m the figure 


I Circuit Operation 

• A — > Rj In this data transfer, transition on is applied to the Data-Out 
control signal of A and that on WDj is applied to the Clr of R^ This activates 
two concurrent processes, in one of which, the data stored m A is placed on 
Its R-out terminals It therefore, gets transferred to the Data Decoder In 
the other process, the data in Rj is cleared, and ClrAckj is generated This 
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IS applied to Cj whose other input receives a transition from WD^ The 
corresponding output transition on routes the data transferred to the Data 
Decodei by Ri to Rj 

• — )■ Pj The transition on is applied to Data-out of R^ and that 
on WP^ IS applied to the Port-out control signal of the Data Decoder This 
transfers the data stored in R^ to P^ 

• P^ — )■ The data available on P^ gets transferred to the Data Decoder A 

transition on WDj clears Rj Subsequently, the Reg-out control signal for the 
Data Decoder is generated This routes the data in the Data Decoder to Rj 

I Delay Constraint 

The delay in the (n + Po — 1) XOR gates at each input of every Basic Data Decoder 
module in the Data Decoder is log 2 {n + Po — 1) times the delay in a XOR gate 
It can be seen that, this delay is equal to the delay m the 1-rail merger used to 
generate the PinalAck signal For the block consisting of the data decoder module 
and the 1-rail merger, the XOR gate delays are assumed to be bounded Therefore, 
this block can be designed such that, by the time, the FmalAck is generated, all the 
extraneous transitions in the Data Decoder are canceled 
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Figure 4 4 Bus Approach 1 Local bus controller 


C Element 

XOR Gates 

[m{7n -f- 2po -1- 1)] — 1 

7Ti[27i^ + 2p^ + 1371 + 2p^ + 2po — 3] 


Table 4 1 Bus approach 1 Circuit complexity 


I Complexity 

The circuit complexity is expressed m terms of the number of C elements and XOR 
gates needed to implement the bus structure It is calculated using the individual 
circuit complexities of the sub-blocks used The circuit complexity for this approach 
IS as given m Table 4 1 

The delays involved m a data transfer are described using a directed acyclic 
graph {V,E}, where, V is a set of vertices and E is & set of edges Each vertex 
represents the completion of an operation in the data transfer The temporal cor- 
relation between successive events t and j is represented by a directed edge from K 
to Vj The delay associated with an edge can be expressed m terms of the number 
of C elements and XOR gates used in generating the successor event We separate 
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the contiibutioii to the two elements by labeling an edge as dl/d2, where, dl is the 
total delay due to the C elements and d2 is that due to the XOR gates The delays 
involved in R^ — ^ Rj for this approach are as shown m Figure 4 5 

® Data_out issued e Clr issued 

1 / 1 + lg(n+pi) ^ Is(na) / 1 

ClrAck generated 
1 /- 

Data in the Data Decoder ^ Control signal to the Data Decoder 

1 /- 

Data to the destination 
lg(m) / 1 + lg(n + po) 

• FmalAck issued 
Figure 4 5 Bus Approach 1 Delay in R^ — > Rj 

The major drawback of this approach is the higher circuit complexity of the 
Data Decoder The number of XOR gates required is proportional to + pi It 
can be reduced with an alternative arrangement to derive feedback for each of the 
Basic Data Decoder modules as shown in Figure 4 6 Here, feedback for each of the 

_|. Pq) Basic Data Decoder modules is derived using a common network of XOR 
gates The Basic Data Decoder module uses 2-input XOR gate instead of (n + po) 
input XOR gate The input signal /»(!), in the Basic Data Decoder module i receives 
a transition when any of the Basic Data Decoder module transfers the data bit 1 
The upper limit on the complexity of feedback processing circuit is Nlog^N , where 
N = n+po The modified circuit complexity and the delay in R, — > Rj are shown 
in Table 4 2 and Figure 4 7 

In the next approach, a way to reduce the circuit complexity further, is shown 



Data at the input of the Data Decoder 
- / Ig (n + po) 
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fl 


fn 



ft(0) 

Basic Data Decoder Module 
Figure 4 6 Bus Approach. 1 Modified decoder 


4.3.3 Approach 2 

The cause of the higher circuit complexity of the Data Decoder of approach 1 is the 
large number of XOR gates used to cancel the extraneous transitions in the Data 
Decoder In this approach, the circuit complexity of the Data Decoder is reduced 
by using a 2 input XOR gate instead of a (n + Po) input XOR gate in the Basic 
Data Decoder module 

In a single data transfer, the Data Decoder used in this approach receives, at 
its input, the data to be transferred, twice The first occurence of the data is used 
to route it to the appropriate destination and the second occurence of the data is 
used to cancel the extraneous transitions in it The application of the data to the 
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C Element 

XOR Gates 

m[7n + 2po + 1] - 1 

-f- 2pi 2] + [ 2(71 + Po)(l + log2{Ti + Po))] 1 


Table 4 2 Bus approach 1 Circuit complexity with the modified decoder 


• Data_out issued 
1 / 2 + lg(n+pi) 

• Data in the Data Decoder 


• Clr issued 
1 + Ig(in) / 1 

O ClrAck generated 
1 /- 



• Control signal to the Data Decoder 


1 /- 

V 

O Data to the destination 
Ig(m) / 1 +• lg(n + po) 


• FmalAck issued 

Figure 4 7 Bus Approach 1 Delay m — > Rj for the modified decoder 


decoder twice in a single data transfer is made possible by the use of a Pass and 
Store circuit 

The datapath to realise this approach is as shown m Figure 4 8a Compared to 
the approach 1, here the 1-rail merger is replaced by a Final Ack Generator 

The Pass and Store block receives the output of a 2-rail merger It has two 
control signals Pass^cnt and Passstore-cnt, and a 2-rail output which is connected 
to the bus The Passstore-cnt control signal places the input data on the output 
terminals In the process, it also retains the data While the Pass.cnt which is 
subsequently applied, is used to shift the stored data to its output, after which no 
data exists in the circuit block This circuit is very easily realised using a UReg and 
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Control Signals 



Figure 4 8a Bus Approach 2 Datapath 


2 XOR gates as shown m Figure 4 8b 

The Basic Data Decoder module is as shown in Figure 4 8c The C elements 
Cl, C 2 and the XOR gates connected to their y inputs constitute a generalized C 
element In every data transfer, the first application of data at one of the inputs 
IS transferred to the corresponding C element m the Generalized C element A 
transition on the Control input then transfers the data bit to the OUT terminals 
Also, the feedback from the output of the C element through the XOR gate places 
the transition on OUT back into the C element As seen from the structure of 
the Data Decoder, since the 2-rail bus is connected as an input to all the Basic 
Data Decoder modules, transition corresponding to the data is consumed only by 
that Basic Data Decoder module B^, which receives the Control signal, whereas it 
IS retained by the remaining modules The second application of the same data, 
therefore cancels these retained transitions as well as the output transition fed back 
to 

The Final Ack Generator receives data from the bus at its input and produces an 
event on the FinalAck signifying the completion of a single data transfer operation 
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Figure 4 8b Bus Approach 2 Pass and Store Circuit 



Figure 4 8c Bus Approach 2 Basic Data Decoder Module 


It can be easily seen that the FmalAck will be generated after the data is transferred 
to the destination Two lealisations of this circuit are shown in Figure 4 8d In the 
first realisation, the x input of the C element will initially have a transition due to 
the global reset The fiist application of the new data cancels this transition Next, 
a transition on the y input due to Pass-cnt will create an output transition after the 
second application of the new data This generates the FmalAck It also takes the 
Final Ack Generator to the state it was, before the application of new data The 
second realisation is based on the Toggle element and is the delay insensitive version 
of the first 
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FinalAck 



In the present approach, the data transfer from the source to destination in all 
the three modes, is very similar to that in the first approach However, due to 
changes in the Data Decoder and the addition of a Pass and Store circuit, there 
are some differences in the actual implementation There are two implementations 
of the controller for this approach These arise, from the way in which the control 
signals Pass^store^cnt and Pass.cnt are generated 

I Local Bus Controller 1 

In this controller the control signals Pass-cnt and Passstore-cnt are generated lo- 
cally for each of the m bits as shown in the figure 4 9 The signal Passstore.cnt 
IS generated by disjunctively combining the ClrAck signals from the registers and 
the WPi to WPo wires While the Pass-cnt is generated by disjunctively combining 
all the DdrAck and PortDdrAck signals of the data decoder Also, as compared to 
the local bus controller of approach 1, it does not use the n C elements Ci to 
Controller of approach 1 uses them to ensure that, in a data transfer to Rj , the data 
IS not transferred to it unless the previous data m it is cleared In this controller the 
same restriction is satisfied by generating the P assstoTe.cnt only after the ClrAck 
of Rj is generated 
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Figure 4 9 Bus Approach 2 Local Bus Controller 1 


I Local Bus Controller 2 

Instead of generating the Pass-cnt and Passstore-cnt locally for each of the m bits 
using the ClrAck and the Data Decoder acknowledge signals, these signals are passed 
to the local bus controller These signals are processed by the local bus controller 
to generate the Pass.cnt and Passstore-cnt signals as shown in Figure 410 


I Complexity 

The circuit complexity of the datapath and both the controllers is as shown in the 
Table 4 3 The delays in Rt — > Rj for both the controller realisations is represented 
in the graphs of Figure 4 11 and Figure 4 12 
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XOR Gates 
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Local bus controller 1 
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m[Qn + 2po + 5] 
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Table 4 3 Bus approach 2 Circuit complexity 
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# Data_out issued 
1 / 1 + lg(n+pi) 

i > Data m the Data Decoder 

1/1 

First data to bus 
1/1 


I Clr issued 
1 / lg(n) 

Pass_store_cnt issued 


Ig = log^ 


O Data to the destination 
- / lg(n+po) 

o Pass_cnt issued 
1/1 


Second data to bus 
l+lg(ni)/2 
• FinalAck issued 

Figure 4 11 Bus Approach 2 Delay m R^ — y for local bus controller 1 
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Clr issued 
2 + lg(m) / 1 + Ig(n) 

Pass_store_cnt issued 


• Data_out issued 
1 / 1 + lg(n+pi) 

Data m the Data Decoder 


1/1 

^ ^ First data to bus 
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* * Data to the destination 
1 + lg(m) / 1 + lg(n+po) 

< ► Pass_cnt issued 
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Second data to bus 
l+lg(m)/2 

• FinalAck issued 

Figure 4 12 Bus Approach 2 Delay in R^ — > Rj for local bus controller 2 
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INI 



4.3.4 Approach 3 

The ciicuit complexity is further reduced by using a different register structure 
instead of REGl As shown in Figure 4 13a, this register has a 2-rail input, In, a 
2-rail output, R-Out, and a control signal, Data-out The stored data is available 
on R-out terminals when Data-out is active However, in this process, the stored 
data is lost On global reset, the register is initialized to a value of 0 because of 
the inverter As can be seen, the register does not have separate mechanism for 
clearing the stored data and putting it on the output terminals Therefore, for a 
register acting as a source, the same data, to be stored, needs to be fed back from 
the output terminals This is achieved by routing the data bit back to the source 
register through the data decoder when it is being routed to the destination register 
To clear the destination register, the stored data is routed to the output terminals 
which are not connected to the bus 

The datapath is as shown in Figure 4 13b The Pass and Store circuit, the Data 
Decoder and Final Ack generator circuits remain the same as in approach 2 In 
addition, it uses three 2-rail mergers, A, B and C, a Select block and a U gate The 
U gate receives the output of merger C It also has two control signals, Clr-cnt and 
Route-cnt and 2-rail outputs Clr.out and Route.out An event on Clr-cnt transfers 
the data at its input to Clv-Out, which is used to generate a ClrAck signal for the 
controller The data to be cleared is routed to these terminals The ClrAck signal 


54 




RLout^ 2-Rail 
! Merger 
Rn out""? A 


2-Rail 

Merger 

B 


Take_port_m 


Final Ack- 


Final Ack Generator 




Clr out 


2-Rail 


jL — " 

PASS AND 

Merger 


^ CIrAck 

STORE 

c 


Route. < 

CIRCUIT 



out r 

— — 1 — 


Route cnt 


CIr cnt 


rPass_store_cnt Pass_cnt 


R2p jrf . 


Control Signals 
Reg_out Port^out 


Rl_out , 
ClrAckl- 

Rn^out ' 
ClrAck(n> 


PORT 1 


PORT Po 


Clrl Data^Outl 


DATA 

DECODER 


Clr(n) Data_Out(n) I I II 

DdrAck PortDddrAck 

Figure 4 13b Bus Approach 3 Datapath 













IS used to derive Route-cnt for the U gate and Pass^store.cnt for the Pass and Store 
circuit as shown While an event on the Route.cnt passes the input to the Pass and 
Store circuit The 2-rail merger A combines the outputs of all the registers while 
B combines all the input ports The merger C combines the outputs of A and the 
Select block 

As can be seen, the data bit in Rj which is to be cleared and the useful data 
bit in which is to be routed to the Rj are both applied to the same input of the 
U gate Therefore, the following condition needs to be satisfied to avoid possible 
hazards The data bit, to be routed to R^, is not applied to U gate until the data in 
Rj is cleaied by the U gate This restriction is satisfied for different modes of data 
transfer as described below 

For Pj y Rj, the bus structure does not have control over the time instant at 
which a new data arrives at the input port Therefore to satisfy the above condition, 
the new data arriving at the input ports is not applied to the input of the U gate, 
until the stored data in Rj is not cleared This is achieved by the Select block in 
Figure 4 13b It receives the data from the 2-rail merger B and passes it to the 
U gate, only after its Take-portjin input is activated by the controller The local 
bus controller generates this event on Take-porLtn only after Rj is cleared For 
R^ — y Rj, Rj is first cleaied The ClrAck generated m this process is fed to the 
conti oiler The controller on receiving it, allows Ri to transfer its data by issuing 
a Data-out signal for R^ This task is realised by the Control Decoder block in the 
controller For R^ — y Pj, the restriction need not be satisfied 

I Local Bus Controller 

The controller shown in Figure 4 14a, takes the usual inputs as described with 
respect to the earlier approaches In addition to these, we assume that control 
signals and R2p-trfase available Signal R2rJ,rfis active for any P, — y Rj, 

while an active R2pjtrf signifies a R^ — y P, These signals are assumed to come 
from the main controller CONI The Control Decoder consists of the Basic Control 
Decoder modules shown in Figure 4 14b The principle of operation is same as that 
of the Basic Data Decoder module excepting for its 1-rail input signal The source 
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Figure 4 14b Bus Approach 3 Basic Control Decoder Module 


control wires W Si to W Sn are applied to the Control Decoder It also receives a 
control signal Cd-control It has n outputs WSii to WSnf For any data transfer 
having register as a source, two transitions are received by the Cd^control wire of 
the Control Decoder, which are applied to all the Basic Control Decoder Modules 
The first transition transfers any event that is present in one of the input terminals 
to the respective output terminal WS^i The second transition on Cd-control 
cancels the extraneous transitions at the input of the corresponding C element For 
those modules which do not have any input event, the second transition cancels out 
the first transition The outputs of the Control Decoder are used to generate the 
corresponding Data-out signals for the source registers 

The controller also contains n Select blocks Each Select block, Sel^, has DdrAck^ 
from the Data Decoder as one of its inputs WS^/ generated within the controller 
and WDi are its another inputs It has two single rail outputs Pass sour cCi and 
Pass-desti A transition on DdrAck^ transfers an event on one of the input ter- 
minals WSi! or WDi to Pass sour ce^ or PassjdesU outputs respectively All the 
n Passsource and n Pass-dest signals are used to generate the Passsnt signal as 
shown in Figure 4 14a An additional Select block in the controller converts the 
ClrAck generated by the datapath to the signal Take-portsn for only the port to 
register data transfers as shown in the figure For register to register data transfer, 
the ClrAck is converted to ClrAck 2 

The functioning of the above Select blocks and the circuitry associated with the 
register to port data transfer will be discussed while describing the actual circuit 
operation The controller also generates the control signals needed by the datapath 
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elements, e g , Data-out for registers, Clr-cnt for U gate and Reg-Out control signals 
for the Data Decoder 

I Circuit Operation 

Rj In this, the main controller provides transitions on WS^ and WDj The 
event on WD^ clears the contents of An event on ClrAck signifies the 
completion of this The Select block in the controller transfers this event to 
ClrAck 2 This event, in turn, is used to produce a transition on Cd-control 
The Control Decoder then transfers a transition on one of its input WS^ to 
the coriesponding output WSj This event in turn, produces a transition 
on Data-out of the corresponding source register Rt This causes the stored 
data of Ri to be transferred to the U gate This data is transferred to the bus 
through the U gate and Pass and Store circuit as events on both the control 
signals Route-cnt of the U gate and Passstore-cnt of the Pass and Store circuit 
are already present as a result of a transition on ClrAck 

The Data Decoder, which already has events on the Reg-out control signals 
of R^ and R^, enables the data to be transferred to both R, and Rj Thus 
Rj gets a new data while Rt retains its original data The Data Decoder 
produces events on its two DdrAck signals corresponding to the data being 
written to the two registers These control signals are processed by two Select 
blocks in the controller associated with i?, and Rj, to generate Passsourcei 
and Pass-destj These signals are received by the x and y inputs of the C 
element, Cp respectively, as shown m the figure, generating a Pass-cnt signal 
Also the Pass-source^ generates a second event on Cd-control input of the 
Control Decoder, canceling the extraneous transitions in it The Pass-cnt 
signal initiates the remaining set of events, which eventually results in an 
event on FmalAck to signal the completion of the data transfer operation 

— y For this data transfer, the temporal restriction involving data removal 

from the destination resource is not necessary The control signals from the 
mam controller drive the wires WS, and WR of the local bus controller kF5, 
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IS applied to the corresponding Reg-out control signal of the Data Decoder 
While W P 1 IS applied to the corresponding Port^out control signal R2p-trf 
feeds the Route^cnt input of the U gate and Passstore-cnt of the Pass and 
Store block It also generates an event on the Cd-control of the control decoder, 
which 111 turn is responsible for transferring the stored data in to the bus 
As all the necessary contiol signals are available, the data is passed to port 
as well as to R^ 

The Data Decoder generates two acknowledges, one on PortDdrAckj assigned 
to Pj and the other on DdrAck^ assigned to The latter is processed in the 
usual way to deposit a transition on the x input of Cp in the Figure 4 14a 
It also creates the additional transition on Cd-control input of the Control 
Decoder The PortDdrAckj, on the other hand, is received by one input of 
C element CPt, the other input of which already has a transition arising out 
of the WPj This creates an event on the output of CPi This event in turn, 
deposits an event on the y input of the C element Cp Therefore a Pass-cnt is 
generated The next set of events are as described before 

Pi — Rj In this, the event on WDj generates an event on the Data-out of Rj The 
same event is deposited as an appropriate Reg.out control signal event in the 
Data Decoder. An event on port to register transfer control signal PBrJrf is 
used to create the Clr-cnt input for the U gate in the datapath This results 
in clearing Rj The ClrAck geneiated in this process, in turn generates an 
event on Take-porLm signal in the Select block of the local bus controller An 
event on Take-pori-tn deposits a tiansition on x input of C element Cp The 
same event allows the data at the input port to be transferred to Rj The 
Data Decoder produces only DdrAck, here The DdrAckj is processed by the 
corresponding Select block and generates an event on Pass-destj wire This, 
in turn, deposits an event on y input of Cp., generating the Pass-cnt signal 
The rest is same as before 
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Ig = log2 


# R2r_trf issued 
2/5 + lg(n) 

O ClrAck generated 
5 / 8 + lg(n) 

First data to bus 
1/1 

Data to the destination 
lg(m) / 1 


* * Ddr_ack generated 

2/2 + lg(n) 

O Pass_cnt issued 
2 + lg(m) / 3 

• FinalAck issued 

Figure 4.15 Bus Approach 3 : Delay in Pj 
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Module 

C Element 

XOR Gates 

Datapath 

Local bus controller 

m[4n + 2po + 11] 

(n + l)(7n + 2) + mpo 

?7i[9n + 2pj + 5po + 21 ] 
7n + po — 3] 


Table 4.4 Bus approach 3 : Circuit complexity 


I Complexity 

The circuit complexity in terms of the C elements and XOR gates is as given in the 
Table 4.4. The delays involved in ^ Rj are as shown in Figure 4.15. 

4.3.5 Delay Constraints for Approach 2 and 3 

By the time the FmalAck is generated, all the extraneous transitions in the Data 
Decoder must be canceled. Let the Data Decoder and the Final Ack Generator 
constitute a block. Then the following delay constraint needs to be satisfied. The 
delay in the XOR gates used to generate FinalAck should be greater than the delays 
in those XOR gates of the Data Decoder which receive data from the bus. We also 
assume the existence of an equipotential region in the block. 

However, as the number of registers and ports increase, the Data Decoder module 
becomes bigger. Satisfying the above constraint can then become difficult. In such 
situations, another version of the Basic Data Decoder module can be used. Using 
this module eliminates the delay constraint altogether. The Basic Data Decoder 
module has 2-rail input, 2-rail output and 2 acknowledge signals, ACKl and ACK2. 
Its input is connected to the bus and output to the input of a register or a output 
port. It is shown in Figure 4.16. On the application of Control, the first data bit 
applied is passed to the dotted terminal of the Toggle blocks. This also generates 
ACKl, which can be used as DdrAck or PortDdrAck. The same is also fed back to 
the y inputs of C elements Cx and (72- The next application of data bit again causes 
Cl or C 2 to create a transition on its output. This, in turn produces the data bit 
on the output terminals. ACK2 is generated to signify completion of this operation. 
This signal can be used to generate the FinalAck. 
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Figure 4.16 DI Basic Data Decoder Module 


Module 

C Elements 

XOR gates 

Bus(Approach 1) 

3031 

11816 

Bus(Approach 2-Cont 1) 

2664 

8078 

Bus(Approach 2-Cont 2) 

3576 

7286 

Bus(Approach 3) 

2760 

4814 


Table 4.5 Circuit Complexity Comparison 


This Basic Data Decoder module can be easily used m approach 2 and 3 with ap- 
propriate changes in the local bus controller. Use of this module leads to an increase 
in the circuit complexity. However, the increase in the delays will be marginal. 

4.3.6 Comparison of the three approaches 

For the three approaches, the circuit complexity and delays in Fii > Rj are given 
in Table 4.5 and Table 4.6 respectively. They are calculated for 50 registers, 8 input 
and 16 output ports. The data is 8 bits wide. 

It is seen that, an attempt to reduce area has resulted in increased delays. It can 
be noted that the major portion of the delay is contributed by the delays in the 1 or 
2-rail mergers, which has logarithmic complexity. Few data transfers, which occur 
more frequently, can be given higher priority by applying the signals corresponding 
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Module 

C Elements 

XOR gates 

Bus (Approach 1) 

9 

8 

Bus(Approach 2-Cont 1) 

8 

- 19 

Bus (Approach 2-Cont 2) 

13 

12 

Bus (Approach 3) 

16 

38 


Table 4.6 Delays in — h Rj 


to them at a latter stage in the merger. This will make these data transfers faster, 
at the expense of making other infrequent data transfers slower. 

It is observed that, for approach 2, controller 1 gives better performance both in 
terms of number of transistors needed and speed. 

4.3.7 — y Pj Data transfer 

As described before, for data transfers between datapath elements, the bus sees the 
inputs and outputs of the datapath elements as output and input ports, respectively. 
Therefore, the data transfer from one datapath element to the other is implemented 
using Pi — y Pj. 

In approach 1 and 2 for realising the bus structure, this can be achieved by 
creating an event on the WPj wire. In approach 3, in addition to this, an additional 
signal P2pJrf is needed, which is active when P^ — P^ is intended. This signal 
should be disjunctively combined with the other signals, which generate an event 
on the X input of Cp in Figure 4.14a. 
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Chapter 5 

Issues in Synthesis 


In this chapter, issues related to synthesizing asynchronous circuits from HDL de- 
scriptions are considered. We assume that asynchronous circuits will be designed 
using our approach. The issues related to synthesis in the synchronous domain are 
well known. We refer [14, 15] for the synthesis of synchronous digital circuits. 

In trying to adapt the hand synthesized design style to the task of automated 
synthesis from HDL descriptions, we observed that, any asynchronous circuit im- 
plementing a design can be represented using two distinct blocks: controller and 
datapath which interact as shown in the Figure 5.1. Furthermore, a natural hierar- 
chy of controllers was seen to be present in most designs. The controller is split into 
the main controller and the local controller embedded within the datapath. The 
controller is split because of the following reasons. 

• The control information in the CDFG or CFG can be used directly to derive 
the main controller. 

• The local controller is derived from the DFG after scheduling and binding 
operations have been carried out on it. It also uses the information related 
to scheduling and binding to derive the specifications for the interconnection 
network. 

The main controller issues control signals C\, . . Cn to the local controller and 
receives a corresponding acknowledge for each of them. The datapath generally con- 
sists of the resources like datapath elements and registers interconnected through the 
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Figure 5 1 Block diagram of the synthesized design 


interconnection netivork Data transfer between the resources is realised under the 
direction of the local controller using the elements m the interconnection network 
The datapath element, in turn, can have their own local controller to internally 
coordinate various events necessary for carrying out their intended functionality 


5.1 Synthesis Outline 

The HDL description is processed to get a Control Data Flow graph (CDFG) The 
CDFG IS further divided into a Control Flow Graph (CFG) and a Data Flow Graph 
(DFG) The CFG and DFG are then used to carry out the remaining synthesis 
process The synthesis task essentially consists of four distinct phases 

1 Scheduling 

2 Resource sharing and binding 

3 Interconnections and storage allocation 
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4 Control unit generation 

Out of these, the hrst two are similar to that in the case of synchronous synthesis 
However, the remaining two steps differ significantly We briefly describe each of 
the phases below 

5.1.1 Scheduling 

In the synchronous design approach, scheduling is the partitioning of the design 
behavior into control steps such that all operations in a control step execute in one 
clock cycle However since there is no clock in asynchronous designs, an alternate 
definition of the scheduling task is as below [15] 

Let the execution delays of the operations in a DFG be denoted by di, , d„ 
We define the start time of an operation as the time at which the operation starts 
its execution Let the start times of the operations are represented by ti, ,tn 
Scheduling is the task of determining the start times, subject to precedence con- 
straints specified by the DFG 

Besides the functional behavior of any datapath element, we associate with it 
additional parameters viz , its area on silicon and the average and worst case delays 
In the design paradigm chosen, any datapath element can take an arbitrary time to 
complete its operation correctly However, scheduling the DFG vnth respect to the 
above mentioned delays can result in an implementation with maximum concurrency 
and sharability of resources, leading to the improved performance 

In the synchronous paradigm, the operations are scheduled in a number of dis- 
crete and equal time steps, the time step being decided by the worst case delay 
amongst the various functional units for unit step functional elements or the best 
case delay for multistep functional units In asynchronous domain, there m no no- 
tion of a clock However, the various operations can still be scheduled over time 
steps The maximum time step for any DFG is the minimum of the individual 
delays of all the datapath elements Smaller the time step, better is the schedule 
obtained However, there is no point in reducing the time step below the minimum 
of the delays in the basic elements used e g C element, XOR gate and inverter 
Also, One can use different time steps for different DFG’s in the same system 
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With the above assumption, we can use the existing scheduling algorithms in 
the synchionous domain Scheduling algorithms in the synchronous domain have 
been divided into two bioad classes depending upon the constraint used to drive the 
scheduling task 

Resource constrained scheduling In tins, there is an upper limit on the number 
of datapath elements that can be used The scheduling is done by assigning 
the aveiage case delays to the datapath elements instead of the worst case 
delays This is possible because, m the design methodology adopted, the 
circuit implementation works correctly, irrespective of the delays m the various 
circuit blocks Therefore, worst case delays need not be assigned to ensure 
the same Using the average case delays increases the possibility of better 
performance 

Time constrained scheduling In this, there is an upper limit on the execution 
time of the CDFG Therefore, the worst case delays must be assigned to the 
datapath elements Furthermore, the delays in the interconnection network 
and in the controller should also be accounted for Also, the assumption that, 
any circuit block can have an arbitrary execution time, will no longer be valid 


5.1.2 Resource Sharing and Binding 

This IS nothing but assignment of operations, memory accesses, and interconnections 
from the behavioral description to hardware units for optimal area and performance 
In oui case, this task is mainly dictated by the fact that the output of each dat- 
apath element and data stored m each register is transferred to the other resources 
through the corresponding intei connection elements in the interconnection network 
The area complexity and the delay associated with these interconnection elements 
increase with the number of destinations, Po, that each source resource has 
Binding Objective : If resources are shared such that, for all n interconnection 
elements, EILi is minimized, we get a minimized interconnection network 
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5.1.3 Interconnections and storage allocation 

The rGsouices can be interconnected using either the point to point topology or the 
bus topology In the point to point topology, the DFG obtained after scheduling 
and binding, is used to generate the interconnection network consisting of the in- 
tei connection elements In oui case, the interconnection elements can also serve 
as storage units In the approach based on the bus topology, the interconnection 
network topology is more generic and is generally fixed apriori Only the registers 
need to be allocated to store constants and variables In both the cases, a set of 
rules is applied to ensure a hazardless implementation of the data transfers implied 
by the scheduling and binding done with respect to a given DFG 

5.1.4 Control unit generation 

As stated before, the main controller is derived directly by using the CFG while 
the local controller is obtained by using the information derived in the above step 
and the DFG It is evident that the mam controller derived m synthesis for both 
the interconnection topologies will be identical The local controller for both the 
topologies cairies out the data transfers between the various resources Data trans- 
fers in the point to point topology can be concurrent, but those on a single bus 
structuic will be necessarily sequential This fact and the implementation differ- 
ences in both the topologies make the realisation of the local controller different for 
the two topologies 

5.2 Synthesis for Point to Point Interconnection 
Topology 

Consider a differential equation integrator, the behavioral description of which is 
given below 

while (ai < a) do 

= x + dx, 
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Ui = u + 3xudx — 3ydx, 
yi = y + udx, 


X = Xi, 

y = Vh 

U = Ui, 

endwhile 

Figure 5 2 shows the DFG after scheduling and binding have been performed It 
has 11 opeiational nodes The implementation employs two multipliers and two 
ALU’s The operations indicated by the DFG are executed iteratively For the sake 
of simplicity, it IS assumed that the delay in ALUs and Multipliers are identical and 
is equal to the time step used to carry out the scheduling task 

As can be seen, that the data that each of the four datapath elements receive at 
their inputs can be grouped into three classes 

• Output of any of the datapath elements (e g , output of ALUl, etc) 

• Constants (e g , ‘a’, ‘dx’ and ‘3’) 

• Variables that are stored across iterations (e g , ‘x’, ‘y and u ) 

The data belonging to all the three classes can be stored in general purpose 
registeis implemented using UReg or REGl, which can then be transferred to the 
appropriate destinations However, it is seen that, by using three different inter- 
connection elements for these classes of data, circuit complexity and performance 
can be improved The output of the each datapath element is directly transferred 
to the required destinations through their corresponding AJDemux element This 
has been discussed in detail m the last chapter As the AJDemux also stores the 
output of the datapath element, till it is transferred to the destination, we call it 
‘AJDemux_Store’ For constants, we use ‘ConstJDemux-Store’ While for the third 
data type, an element called Tter-Var_Store’ is used The implementation of these 

elements is given later 

Let the DFG consists of m vertices repiesenting m operations We assume that 
these operations are earned out by n datapath elements, n <= m Each datapath 
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Figure 5 2 DFG of Differential Equation Integrator 


element generates an acknowledge to indicate the completion of its operation Also, 
Iter_Var_Store generates an input acknowledge each time a new data is written into 
it by its source datapath elements The control signals necessary to coordinate the 
execution of the m operations and the data transfers m a DFG are generated by 
the local controller using these acknowledge signals and the control signals from 
the main controller To carry out the above task, the local controller may also 
need a few input acknowledge signals from some of the datapath elements The 
input and output acknowledge signals for a 2-input, single output datapath element, 
generated while executing node K m a DFG, are indicated by an AA) A-fa a-nd A^O 
respectively While the ith input acknowledge signal for an Iter_Var_Store element 
R is indicated by ARi 
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5.2.1 Data Transfer Guidelines 


We use the following notation for the discussions below DPE^ represents a partic- 
ulai instance of a datapath element or a functional unit By RESi- we mean the 
kth lesource, wheie a resource signifies one of the following, a datapath element, a 
stoiage unit or an interconnection unit 

Assume that a datapath element DPE^ has performed its current operation 
and its output is transferred to AJDemuxStoreji This data is to be transferred 
to one of the inputs of the datapath element DPEj, to start a new operation Vy of 
DPEj Assume that the last data written into the AJDemuxStore-j needs to be 
transferred to another resource RESi Then the following is true 

1 If each output of DPEi is always transferred to the same input of DPEy and 
this input of DPEj receives the output of DPEi only, then the corresponding 
AJDemuxStoret can be replaced with wires 

2 Data from AJDemux^ is not transferred to DPEj unless the following condi- 
tions are satisfied 

(a) The DPEj has completed its previous operation This condition needs to 
be satisfied to avoid hazards at the corresponding input of DPEj This 
is also true for t = j 

(b) The previous output of DPEj has been transferred to RESk If data is 
transferred to DPEj from AJDemux.Store^ without satisfying the above 
condition, the next output of DPEj may create hazards at the input of 
AJDemuxJStorej 

An event on the output acknowledge Aj O of DP Ej , prior to the desired 
data transfer, signifies that the first condition can be met satisfactorily, 
while that for the second condition is signified by an event on the input 
acknowledge, A AT,, of RES^ 


Special Cases 
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(a) If the data in AJDemux^store^ is to be transferred io DP then both 
the conditions need not be satisfied In this case, there in no possibility 
of hazard at all However, if AJDemux^ has multiple outputs, the output 
acknowledge is used to transfer the data back to DPE^ 

(b) If pievious output of DPE, is used by itself m only the first condition 
needs to be satisfied In this case, the DPEj will receive data from 
A-Demux store j at its one input and from A-Demuxstorei at its other 
input 

(c) If the data is being written into the DPEj for the first time in the current 
execution of DFG, both the conditions need not be satisfied However, 
as m the first special case , output acknowledge A^O of DPE^ is used to 
initiate the data transfer 

3 Deadlock condition If RES^ is DPEi, then a deadlock condition can occur 
This IS illustrated with an example Figure 5 3 shows a DFG with four nodes 
An adder and a multiplier are employed to execute these four operations 
Figure also shows the A_Demux_Store blocks for the adder and the multiplier 
obtained by applying the above stated rules It can be seen that, the signals 
A4/1 and Azh are never generated and a deadlock condition occurs In order 
to avoid this, one of the above mentioned signals is not used 

4 If current output of DP Ex is to be used later by any datapath element, after 
one or more operations are done by DPEi, it needs to be stored Figure 5 4 
shows such a situation in a DFG employing an adder and a multiplier The 
modified AJDemux-Store for the multiplier is as shown in the figure 

5 If the output of DPEi is to be transferred to a Iter.VarJStore, the correspond- 
ing signal to the AJDemuxJStoret is not issued unless the earlier data stored 
in the corresponding Iter_Var_Store is cleared An implementation of this is 
specific to a particular realisation employed for the Iter_Var-Store element . 

6 The control signals to the AJDemux^tore^ obtained by the application of 
above rules may occur concurrently if there is no temporal relation between 
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Mult optput 



Figure 5 3 Deadlock condition 


them This will create malfunction in AJDemuxStorei This can be easily 
avoided as described next Let the control signals Cp and Cq occur concur- 
rently Cp and Cq transfer the output of DPE^ generated while executing Vp 
and Vq respectively Then Cp should be combined with ApO using a C ele- 
ment, before applying it to the AJDemux^tore, In a similar way, Cq should 
be combined with AqO The control signals to the other interconnection ele- 
ments Const -Demux JStore and Iter_Var_Store can be applied concurrently 

5.2.2 The Local Controller 

For a DFG with m vertices, the local controller decodes the output acknowledges 
of n datapath elements into m signals AiO to AmO This is done using n Event 
Counters as described next Let a datapath element DPE^ execute the operations 
in vertices Vp, Vq and K of DFG The output acknowledge signal oiDPE^ is applied 
at the Eventin input of a modulo-3 Event Counter to derive ApO, AqO and A,0 at 
Its outputs. If a few of the input acknowledges AJi and/or A^h , ^ = 1, , m, are 

needed, then by adding the necessary circuitry at these inputs of datapath elements, 
we generate the input acknowledges from the datapath elements These signals are 
then decoded using Event Counters to get the necessary signals 

Controller uses the above signals along with the control signals from the mam 
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Figure 5 4 Modified AJDemux_Store 


controlk'r to generate control signals for the datapath elements and interconnection 
network It also generates acknowledge signals for the main controller using the 
same signals 

5.2.3 Interconnection Elements 

Depending on three different classes of data the datapath element receives, three in- 
terconnection elements are needed to store and transfer the data viz , AJDemux_Store, 
Corist_Demux_Store and Iter_Var_Store Out of these, A_Demux^tore is nothing but 
AJDcmux as described at the start of this section The remaining two elements are 
discussed here 

These elements are described with the help of DFG shown in Figure 5 5a 
Vj, and Vfc are the three operational nodes in the DFG, which receive the data 
represented by DAT Further, the following is assumed These nodes are executed 
by DPE„ DPEj and DPEk respectively The data DAT is transferred to these 
datapath elements using the control signals C,, Cj, and Ck respectively The DFG 
IS executed in iterations and each iteration is initiated by the controller through an 
event on SEiteration control signal Done is the control signal indicating the end of 
the loop 
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Figuie 5 5a Vertices m a DFG receiving the same data DAT 
I ConstJDemux^tore 

Let DAT be a constant used in the execution of a loop Also, let all the operations 
V, Vj, and Ti4 be executed in each iteration Then Const JDemuxJStore element 
for DAT IS as shown in Figure 5 5b On each application of StarLtterahon, the 
data m UReg is made available at its Datastore terminals This data can then 
be transferred to the desired datapath elements using either Cj, Cj, or Ci The 
application of control signal Done clears the data m UReg by transferring it on the 
Out terminals The same task could have been achieved by replacing the Select 
blocks with an AJDemux block However, this results in greater circuit complexity 
Also, in this implementation, concurrent application of Cj, Cj and C*, is possible 
This may improve performance 

If the execution of Vj and Vk is conditional in individual iterations, then the 
Const JDemux_Store is realised as shown in the Figure 5 5c Here, Cj j, is the control 
signal generated by the controller when Cj or Ck are to be executed 

I Iter^VarStore 

Let DAT be a variable which is used across iterations and is also updated in each 
iteration Then the Iter_Var_Store corresponding to DAT is shown in Figure 5 5d 
An event on Start-iteration transfers the data in the U-gate to the Select blocks, 
which can then be routed to the datapath elements After this, no data exists in the 
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Done 


To DPEi 


To DPE, 


■'J 


To DPEk 


Figure 5 5b ConstJDeraux_Store 


U-gate As a result of execution of DFG, a new data value is written in the U-gate, 
which IS used in the next iteration Activation of the JDone signal transfers the data 
updated and stored in the U-gate in the last iteration, to the Out terminals 

Consider that the execution of Vj and 14 is conditional The Iter_Var_Store can 
have different realisations depending on how the variable DAT is used Two typical 
cases are discussed below 

• Let DAT be updated by the executions of 14 and either Vj or 14 Figure 5 5e 
shows the Iter_Var_Store for this case 

• Let DAT be updated by the execution of K only Then the corresponding 
implementation is as shown in Figure 5 5f When the data is to be transferred 
to DPEj and DPEk, h should also be retained, so that it can be transferred 
to DPE^ when needed in the next iteration Therefore, an UReg is employed 
The particular version of Iter_Var_Store needed to achieve this for different 
cases can be similarly derived for any DFG 


5.2.4 Examples 

We employ the above ideas to derive asynchronous implementations from the HDL 
descriptions of the Shift Multiplier and the Differential Equation Integrator 
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To DP^ To DPBc To DPEi 


Figure 5 5c Const JDemux_Store to support conditional execution of Vj and 14 
I Shift Multiplier 

In the Shift Multiplier the multiplicand is represented by variable B and the multi- 
plier by A The final result is indicated by M 

The scheduled CDFG for the Shift Multiplier [14] is as shown in Figure 5 6 
The mam controller is obtained directly using the control flow information in 
the CDFG and is shown in Figure 5 7a The loop consisting of four iterations is 
implemented using the mod-5 Event Counter The main controller generates control 
signals Add and Shift for the local controller to carry out the implied functionality 
It also issues the signal Done, which signifies the end of the multiplication operation 
for a given set of input instances It receives AddAck and ShiftAck from the local 
controller 

The DFG implementing the multiplication operation has three vertices Vi to 
Vs, as shown in Figure 5 7b Let an adder and a two shifters SHRl and SHR2 
be assigned to these operations The interconnection elements AJDemux^torei 
to AJDemux -Stores for these datapath elements are derived and are shown in Fig 
5 7c Consider AJDemuxStorei and A-DemuxStore 2 The outputs of both are 
transferred to the same destination, which stores M. Also, the two control signals 
AiO and A 2 O can never occur concurrently Therefore, a single AJDemux_Store can 
be used instead of two, as shown in Fig 5 7e 
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To DPEi To DPEj To DPEk 

Figure 5 5d Iter_Var_Store 


It can be seen from the DFG that 5 is a constant and it is applied to the adder 
once in each iteration So the corresponding ConstJDemux_Store takes the form 
of a UReg Variables A and M are needed across iterations The corresponding 
implementation of Iter_VarJStore for the variable M is as shown in Figure 5 7d, 
while that for variable A takes the form of a U-gate 

All the circuit blocks are interconnected to get the datapath shown in Figure 
5 7e It can be seen that, the Start signal is used to initialize A, B and ]\I Also the 
Done signal generated by the mam controller is used for transferring the product 
available in Iter_Var_Store modules corresponding to variables A and M The local 
controller is derived as described before and is shown m Figure 5 7f 

I Differential Equation Integrator 

The behavioral code and the DFG for the Differential Equation Integrator after 
scheduling and binding is given earlier in Figure 5 2 The ALU used in this example 
performs addition, subtraction and comparison It has two 2-rail inputs and three 
1-rail control signals Add, Sub and Comp The result of addition and subtraction 
IS made available on its 2-rail output terminals While the result of companson 
operation is made available on one of its three 1-rail outputs, namely a-ge.b, aJeJ) 
and a.eqJ) The completion of each operation is signified on its output AluOutAck 
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r Done 


U-Gate 



To DP^ To DPEk To DPEi 

Figure 5 5 e Iter_VarJStore Case 1 

In addition to input, output and output-acknowledge terminals, the multipliers also 
have a control signal Multiply, which initiate the multiplication operation 

The synthesis process, as outlined below, starts with the derivation of the mam 
controller The main controller is as shown in Figure 5 8a Execution of each 
new iteration is initiated by creating an event on SLiteration control signal The 
completion of execution of each iteration is signified by an event on the ItCompAck 
wire The mam controller also receives a 2-rail signal X-geJi from the datapath If 
the value of this signal is 1, controller initiates the next iteration 

The interconnection network is derived next First, the AJDemux_Store for the 
multipliers and the ALUs is obtained by applying the set of rules described earlier 
They are shown in Figure 5 8b along with the datapath elements that they are 
connected to The following can be observed 

• The output of multiplier MULi generated while executing either V3 or V7 
IS transferred to the same input of ALUi This results in a less complex 
AJDemux-Store element which has two outputs instead of three The same is 
true for multiplier MUL2 

• Consider AJDemux_Store element corresponding to MUL2 The control sig- 
nals AiiO and A4J2 may occur concurrently Therefore, A^O is combined with 
A4/2, and AgO with AuO, using C elements, as described m Section 5 2 1 
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Cl 

Done 



Out ToDPEi ToDP^ ToDPEk 

Figure 5 5f Iter_Var_Store Case 2 


The interconnection elements Const JDemux_Store for the constants a, dx and 3 
are shown in Figure 5 8c The realisation of all of them is very similar to that shown 
earlier in Figure 5 5b 

Figure 5 8c shows the Iter_VarJStore elements for all the three variables x, y and 
u Their realisation is similar to that shown earlier in Figure 5 5d 

Finally, we describe the local controller implementation It is shown in Figure 
5 8d It employs Event Counters to generate almost all the control signals needed for 
the interconnection network It also generates the control signals Add, Sub, Comp 
and Multiply for the ALUs and multipliers respectively The signal ItCompAck is 
generated using the input acknowledge signals of the three Iter_VarJStore elements 
corresponding to the variables x, y and u The signal X-ge^A which is sent to the 
main controller is derived using the 1-rail outputs of ALU2 


5.3 Synthesis for Bus Interconnection Topology 

In a digital system, let the number of resources amongst which, data transfer needs 
to be carried out increase In such cases, the point to point interconnection topology 
may result in increased cost of realisation in terms of area, and can also result in 
degraded performance due to increased delays To reduce this area cost, we can 
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instead realise these data transfers using the bus topology 

In the discussion that follows, the design process and the synthesis issues based 
on the bus topology are highlighted The basic idea is illustrated through three 
examples wherein we also give methods of improving the performance m the bus 
topology The first design is that of a Shift Multiplier The second design is based on 
the same multiplier implemented using a combination of point to point topology and 
the bus topology Here we highlight the use of a multi-bus structure The final design 
IS that of a Differential Equation Integrator Through this example, we illustrate 
an efficient method of using the bus structure, by appropriately ordering the data 
transfers We wish to make it clear that for these examples, the point to point 
topology results in better implementations in terms of both area and performance 
However, our focus here is on the design process and the synthesis issues 

5.3.1 Example 1 : Shift Multiplier 

We consider the same shift multiplier discussed with respect to the implementation 
based on the point to point interconnection topology We give its implementation 
using the bus topology As before, the resources needed to implement the shift 
multiplier are an adder, a shifter and registers to store the variables, A, B and 
M Figure 5 9a shows the abovementioned resources symbolically connected to the 
bus The actual circuitry and the local bus controller are not shown for the sake of 
simplicity 

Unlike registers, the datapath elements do not contain any data when they are 
idle Therefore, prior to transferring data to, we do not need to clear them The 
data to their inputs is transferred using either a Ri — v P^, or a P^ — > Pj mode 
of data transfer Their inputs can, therefore, be treated as output ports Hence, 
these are named as P-ADDJi, P-ADDJ 2 etc The output of each datapath 
element is stored in an associated Select block On the application of an event 
on its single rail control input, the stored data is transferred to its output These 
outputs are connected to the bus The data from the Select blocks are transferred 
to the destination registers using — > Rj data transfer mode Therefore, these 
outputs act as input ports for the bus structure and are labeled accordingly, e g 
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Mam conti oiler signals 

Local controller signals 

Operation 

Start 

Start 

A-Port — J- AJieg 


Start2 

BJPort — ^ B-Reg 


Starts 

0 — > M-Reg 

Add 

Add 

M-Reg — ^ PJiDDJi 


Add2 

B.Reg — P.ADDJ 2 


Adds 

PJiDDDUT MJieg 

Shift 

Shift 

M.Reg PSH Ji 


Shift2 

A.Reg —4 P.SHJ 2 


Shifts 

P^HDUTi — > MJieg 


Shift4 

P.SH.OUT 2 — > AJieg 

Done 

Done 

MJieg — y P-outjrasb 


Done2 

AJieg — y P-outJsb 


Table 5 1 Data transfers m Single-bus Shift Multiplier 


P-ADD-OUT, P^H-OUTi, etc The completion of operation performed by each 
datapath element is signified by an event on its Ack output, e g Adder Ack, ShrlAck 
etc After completion of multiplication, the product is sent out to the output ports 
P-outJsb and P^ouLmsb 

The mam controller derived from CFG is same as that of the shift multiplier 
designed before, except for the inclusion of two more C elements as shown in the 
Figure 5 9b These C elements are included to acknowledge the completion of two 
sets of data transfer, each initiated by the Start and the Done signals, as shown in 
the Table 5 1 In this Table, each control signal C^ issued by the mam controller, is 
used by the local controller to invoke a sequence of data transfers to implement the 
desired functionality implied by Let n data transfers be needed to achieve this 
Then the local controller generates n control signals sequentially Excepting for the 
fiist, each of them is generated only after an appropriate acknowledge is received 
from the bus 

Table 5 1 shows the signals issued by the local controller in response to each 
control signal sent by the mam controller, and the corresponding data transfer to 
be implemented using each of them Two examples of data transfer implementation 
follows 
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• Control signal Starts should be used as a control signal for the 1-2 Converter 
associated with port B The same should be used to generate an event on 
the appropriate destination control wire of the bus structure, which in turn 
generates the Clr input of B JReg to clear it 

• Similarly, Add is used to generate an event on the WS^ control signal associated 
with the MJFleg and WP^ associated with the PjSHJi The generation of 
the bus control signals from those of the local controller can be done using 
XOR gates, eg, WS^ corresponding to the A_Reg is derived by disjunctively 
combining the Shz ft 2 and the Done^ control signals 

Figure 5 9c and Figure 5 9d shows how the local controller generates the control 
signals listed in column 2 of Table 5 1 It uses signals from the mam controller, the 
acknowledge signals signifying the completion of individual data transfers and the 
output acknowledge signals of the datapath elements A single FinalAck is decoded 
into 12 different acknowledge signals by the local controller as shown in Figure 5 9c 
Out of these, 4 signals are passed to the mam controller and remaining signals are 
used by the local controller A Control Decoder and several Event Counters are 
employed to generate these signals as shown m Figure 5 9d 

5.3.2 Example 2 : Shift Multiplier : Mixed Approach 

The datapath is as shown in Figure 5 10a In this example, A_Reg, one input 
of shifter PJSHJ 2 , one output of the shifter P-SHJDUT 2 and the output port 
P-outJsb are connected to the Bus-1 While ports PSH-Ii, P-ADDJi, P-ouLmsb, 
P-ADD-OUT, PSHjOUTi and M_Reg are connected using Bus-2 The B-Reg is 
directly connected to the second input of the adder This is an instance of a point 
to point interconnection The mam controller remains the same as m the previous 
example The local controller receives the control signals from the mam controller 
and generates its own signals to carry out the necessary data transfers as shown m 
the Table 5 2a and 5 2b As can be seen from the tables, many data transfers are 
concurrent, thus increasing the speed of execution 
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Main con 

Local con 

Operation on 

signals 

signals 

Bus 1 

Bus 2 

Start 

Start 

AJPort — y A-Reg 

0 — M-Reg 

Add 

Add 

- 

M-Reg — > P-ADDJi 


Add2 

- 

P-ADD.OUT MMeg 

Shift 

Shift 

A-Reg — > PSH-I^ 

M-Reg PSHJi 


Shift2 

P-SH-OUT 2 — ^ AJieg 

PSH-OUTi — ^ M-Reg 

Done 

Done 

AJR,eg — > P-Out-lsb 

M-Reg — > P-out-msb 


Table 5 2a Data transfers in Multi-bus Shift Multiplier 


Main controller 

Local controller 

Operation 

signals 

signals 

Non-bus 

Start 

Start 

B-Port — >■ B-Reg 

Add 

Add 

B-Reg — y P-ADD-I 2 


Add2 

- 

Shift 

Shift 

- 


Shift2 

- 

Done 

Done 

- 


Table 5 2b Data transfers in Multi-bus Shift Multiplier 
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Time Step 

Operation 

Dependencies 

1 

Fi 

- 


F2 



Fio 

- 

2 

Fa 

Fi,F2 


Fe 

F2,F3 


Fii 

Fio 

3 

F4 

Fa 


Vj 

F4,F6 


Fs 

F6,Fr 

4 

Fs 

F4,F7 


Fg 

F8,Fn 


Table 5 3 Operation dependencies in Differential Equation Integrator 


5.3.3 Example 3 : Differential Equation Integrator 

Consider the DFG scheduled in four time steps as shown in Figure 5 2 Let all the 
four datapath elements and the six registers used for storing three constants and 
three vaiiables be connected to a single bus Further, assume that, all the eleven 
operations in the DFG take the same amount of time to execute In such a situation, 
execution of operations corresponding to any time step can be ordered to improve 
performance 

Table 5 3 shows the data dependency of each operation on the other operations 
in the DFG These dependencies can be easily determined from the DFG Assume 
that a datapath element DPEm has completed its current operation V, Let the 
lesult of V; be used to initiate another operation Also assume that the next 
operation to be executed by DPEm is F, Then a temporal constraint needs to 
be imposed on the execution of F* which can not begin until F has transferred its 

output to Vj 

I Ordering of Operations 

1 The various operatious corresponding to a single tune step are first ordered 
according to the dependencies that they have to foUow Therefore, in time 
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step 2, V3 should be executed before Ve Similarly, in time step 3, ordered 
sequence of executions is V4, V7 and then Vs 

2 Using the ordering earned out in step 1, the remaining operations can be 
ordered to improve performance 

(a) As V4 IS executed before Vs in time step 3, V5 should be executed before 
Vg in time step 4 

(b) Execution of V7 depends on Vg Therefore, Vg is executed before Vn in 
time step 2 

(c) Execution of V3 m time step 2 depends on the execution of Vi and V2 
Therefore, in time step 1, Vi and V2 are executed before Vio It can be 
noted that Vi and V2 can be executed m any order 
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Figure 5 6 CDFG of the Shift Multiplier 
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Figure 5 7b DFG of the Shift Multiplier 
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Figure 5 9a Shift Multiplier(Single-bus) Datapath 
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Figure 5 10a Shift 
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Chapter 6 

Conclusion and Future Work 


Conclusion : 

Design of basic modules to systematically realise and synthesize implementa- 
tions for asynchronous systems using the 2-phase NRZ transition signalling protocol 
has been given The use of these modules has been illustrated with three design 
examples, viz Shift Multiplier, Polynomial Serial Parallel Multiplier and Counter 
Interconnection schemes based on both the point to point topology and the bus 
structure have been given Three approaches to implement the bus structure have 
been described Finally, synthesis issues have been discussed The synthesis pro- 
cedure outlined has been justified through two examples, viz Shift Multiplier and 
Differential Equation Integrator 

Future Work : 

In the synthesis approach, based on the bus topology, all data transfers needed 
to execute a set of operations, have to be implicitely sequential This can work well 
with the smaller designs However, for larger designs distributed over many chips, 
it can be a major restriction Synthesis for such designs should allow data transfer 
between different circuit blocks on different chips as and when they are ready for 
the data transfer to take place This necessitates the use of arbitration to resolve 
conflicts arising out of concurrent requests for any common resource, such as a bus 
To take care of this, we need to develop suitable arbitration logic based on our 
signalling protocol 
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Three approaches have been discussed in [9, 10] to realise Boolean expressions 
using U gates None of these can be shown to lead to optimal expressions A 
multilevel logic optimization method needs to be formulated to exploit the basic 
property of the U gate, m that it generates all the 2” minterms with respect to n 
inputs 
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Appendix A 


Simulations - SPICE and 
VERILOG 


A detailed simulation using SPICE, at the transistor level, has been carried out 
for the C element and XOR gate Using typical process parameters for a 1 /um 
technology, from the SPICE simulations the average case delays with respect to 
100/F load capacitor (50 unit loads, where 1 = 2 fF) for the C element and 

XOR gates implemented using minimum feature size transistors are found to be 
1 8 and 1 1 respectively Similarly typical delay for a UReg based on SPICE 
simulations is 3 nS 

The delays for the basic elements obtained through SPICE are used in the Verilog 
model for any design implementation Specifically, simulations m Verilog have been 
earned out using the above delays for the structural models of the implemented 
designs All the basic modules and the designs described in this thesis have been 
simulated in Verilog except for the designs described m the last section of chapter 
5 1 e , design examples based on the bus interconnection topology 

Some of the simulation results are listed below 

1 For the 4-bit Shift Multiplier implemented using the point to point intercon- 
nection topology, the worst case delay (multiplying decimal 15 by 15) is found 
to be approximately 300 nS While the best case delay (multiplying decimal 
15 by 0) IS found to be approximately 136 nS The worst case and the best 
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case delays are not in relation to the parametric delays but with respect to 
data dependent delays The corresponding synchronous implementation will 
take 14 clock cycles to execute the multiplication for all the possible cases 

2 The Differential Equation Integrator has been simulated using the behavioral 
modules of the ALU and the 4-bit multiplier The delay of 25 nS was assigned 
to the ALU and that of 200 nS to the multiplier The delay in executing a 
single iteration is found to be 685 nS 

3 The Polynomial Serial Parallel Multiplier has a delay of approximately 50 nS 
for each serial output bit produced i e , the throughput for PSPM is about 
50 nS 

The behavioral and the structural descriptions in Verilog have been created for 
all the modules and can be invoked as basic library elements for creating new design 
implementations and simulating them 

While simulating every design implementation, a test for the delay-insensitivity 
was conducted by assigning arbitrarily large and random delays to different mod- 
ules and interconnections In every case, the corresponding implementation worked 
correctly thus proving that this methodology is truly delay-msensitive 
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